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Preface  to  the  Fourth  Edition 


The  fourth  edition  of  this  book  on  Applied  Multivariate  Statistical  Analysis  offers 
a  new  sub-chapter  on  Variable  Selection  by  using  least  absolute  shrinkage  and 
selection  operator  (LASSO)  and  its  general  form  the  so-called  Elastic  Net. 

All  pictures  and  numerical  examples  have  been  now  calculated  in  the  (almost) 
standard  language  R  &  MATLAB.  The  code  for  each  picture  is  indicated  with 
a  small  Q  sign  near  the  picture,  e.g.  Q  MVAdenbank  denotes  the  corresponding 
quantlet  for  reproduction  of  Fig.  1.9,  where  we  display  the  densities  of  the  diagonal 
of  genuine  and  counterfeit  bank  notes.  We  believe  that  these  publicly  available 
quantlets  (see  also  http://sfb649.wiwi.hu-berlin.de/quantnet/)  create  a  valuable 
contribution  to  distribution  of  knowledge  in  the  statistical  science.  The  symbols  and 
notations  have  also  been  standardised.  In  the  preparation  of  the  fourth  edition,  we 
received  valuable  input  from  Dedy  Dwi  Prastyo,  Petra  Burdejova,  Sergey  Nasekin 
and  Awdesch  Melzer.  We  would  like  to  thank  them. 

Berlin,  Germany  Wolfgang  Karl  Hardle 

Louvain  la  Neuve,  Belgium  Leopold  Simar 

January  2014 
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Preface  to  the  Third  Edition 


The  third  edition  of  this  book  on  Applied  Multivariate  Statistical  Analysis  offers  the 
following  new  features. 

1.  A  new  Chap.  8  on  Regression  Models  has  been  added. 

2.  Almost  all  numerical  examples  have  been  reproduced  in  MATLAB  or  R. 

The  chapter  on  regression  models  focuses  on  a  core  business  of  multivariate 
statistical  analysis.  This  contribution  has  not  been  subject  of  a  prominent  discussion 
in  earlier  editions  of  this  book.  We  now  take  the  opportunity  to  cover  classical 
themes  of  ANOVA  and  ANCOVA  analysis.  Categorical  responses  are  presented  in 
Sect.  8.2.  The  spectrum  of  log  linear  models  for  contingency  tables  is  presented  in 
Sect.  8.2.2,  and  applications  to  count  data,  e.g.  in  the  economic  and  medical  science 
are  presented  there.  Logit  models  are  discussed  in  great  detail,  and  the  numerical 
implementation  in  terms  of  matrix  manipulations  is  presented. 

The  majority  of  pictures  and  numerical  examples  has  been  now  calculated  in  the 
(almost)  standard  language  R  &  MATLAB.  The  code  for  each  picture  is  indicated 
with  a  small  Q  sign  near  the  picture,  e.g.  Q  MVAdenbank  denotes  the  corresponding 
quantlet  for  reproduction  of  Fig.  1.9,  where  we  display  the  densities  of  the  diagonal 
of  genuine  and  counterfeit  bank  notes.  We  believe  that  these  publicly  available 
quantlets  (see  also  www.quantlet.com)  create  a  valuable  contribution  to  distribution 
of  knowledge  in  the  statistical  science.  The  symbols  and  notations  have  also  been 
standardised.  In  the  preparation  of  the  third  edition,  we  received  valuable  input  from 
Song  Song,  Weining  Wang  and  Mengmeng  Guo.  We  would  like  to  thank  them. 

Berlin,  Germany 
Louvain  la  Neuve,  Belgium 
June  2011 


Wolfgang  Karl  Hardle 
Leopold  Simar 
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Part  I 

Descriptive  Techniques 


Chapter  1 

Comparison  of  Batches 


Multivariate  statistical  analysis  is  concerned  with  analysing  and  understanding  data 
in  high  dimensions.  We  suppose  that  we  are  given  a  set  {xi}ni=x  of  n  observations 
of  a  variable  vector  X  in  Rp .  That  is,  we  suppose  that  each  observation  has  p 
dimensions: 


•^i  U/ 1  •>  2 ,  .  .  .  ,  Xfp  J , 

and  that  it  is  an  observed  value  of  a  variable  vector  X  e  Rp .  Therefore,  X  is 
composed  of  p  random  variables: 

X  =  (XuX2,...,Xp) 

where  Xy,  for  j  —  1,...,/?,  is  a  one-dimensional  random  variable.  How  do 
we  begin  to  analyse  this  kind  of  data?  Before  we  investigate  questions  on  what 
inferences  we  can  reach  from  the  data,  we  should  think  about  how  to  look  at  the  data. 
This  involves  descriptive  techniques.  Questions  that  we  could  answer  by  descriptive 
techniques  are: 

•  Are  there  components  of  X  that  are  more  spread  out  than  others? 

•  Are  there  some  elements  of  X  that  indicate  sub-groups  of  the  data? 

•  Are  there  outliers  in  the  components  of  X  ? 

•  How  “normal”  is  the  distribution  of  the  data? 

•  Are  there  “low-dimensional”  linear  combinations  of  X  that  show  “non-normal” 
behaviour? 

One  difficulty  of  descriptive  methods  for  high-dimensional  data  is  the  human 
perceptional  system.  Point  clouds  in  two  dimensions  are  easy  to  understand  and  to 
interpret.  With  modern  interactive  computing  techniques  we  have  the  possibility 
to  see  real  time  3D  rotations  and  thus  to  perceive  also  three-dimensional  data. 
A  “sliding  technique”  as  described  in  Hardle  and  Scott  (1992)  may  give  insight 


©  Springer- Verlag  Berlin  Heidelberg  2015 

W.K.  Hardle,  L.  Simar,  Applied  Multivariate  Statistical  Analysis, 

DOI  10.1007/978-3-662-45171-7  1 


3 


4 


1  Comparison  of  Batches 


into  four-dimensional  structures  by  presenting  dynamic  3D  density  contours  as  the 
fourth  variable  is  changed  over  its  range. 

A  qualitative  jump  in  presentation  difficulties  occurs  for  dimensions  greater 
than  or  equal  to  5,  unless  the  high-dimensional  structure  can  be  mapped  into 
lower-dimensional  components  (Klinke  &  Polzehl,  1995).  Features  like  clustered 
sub-groups  or  outliers,  however,  can  be  detected  using  a  purely  graphical  analysis. 

In  this  chapter,  we  investigate  the  basic  descriptive  and  graphical  techniques 
allowing  simple  exploratory  data  analysis.  We  begin  the  exploration  of  a  data 
set  using  boxplots.  A  boxplot  is  a  simple  univariate  device  that  detects  outliers 
component  by  component  and  that  can  compare  distributions  of  the  data  among 
different  groups.  Next,  several  multivariate  techniques  are  introduced  (Flury  faces, 
Andrews’  curves  and  parallel  coordinates  plots  (PCPs))  which  provide  graphical 
displays  addressing  the  questions  formulated  above.  The  advantages  and  the 
disadvantages  of  each  of  these  techniques  are  stressed. 

Two  basic  techniques  for  estimating  densities  are  also  presented:  histograms  and 
kernel  densities.  A  density  estimate  gives  a  quick  insight  into  the  shape  of  the 
distribution  of  the  data.  We  show  that  kernel  density  estimates  (KDEs)  overcome 
some  of  the  drawbacks  of  the  histograms. 

Finally,  scatterplots  are  shown  to  be  very  useful  for  plotting  bivariate  or 
trivariate  variables  against  each  other:  they  help  to  understand  the  nature  of  the 
relationship  among  variables  in  a  data  set  and  allow  for  the  detection  of  groups  or 
clusters  of  points.  Draftman  plots  or  matrix  plots  are  the  visualisation  of  several 
bivariate  scatterplots  on  the  same  display.  They  help  detect  structures  in  conditional 
dependencies  by  brushing  across  the  plots.  Outliers  and  observations  that  need 
special  attention  may  be  discovered  with  Andrews  curves  and  PCPs.  This  chapter 
ends  with  an  explanatory  analysis  of  the  Boston  Housing  data. 


1.1  Boxplots 

Example  1.1  The  Swiss  bank  data  (see  Chap.  22,  Sect.  22.2)  consists  of  200 
measurements  on  Swiss  bank  notes.  The  first  half  of  these  measurements  are  from 
genuine  bank  notes,  the  other  half  are  from  counterfeit  bank  notes. 

The  authorities  measured,  as  indicated  in  Fig.  1.1, 

X\  —  length  of  the  bill 
X2  =  height  of  the  bill  (left) 

X3  —  height  of  the  bill  (right) 

X4  =  distance  of  the  inner  frame  to  the  lower  border 
X5  =  distance  of  the  inner  frame  to  the  upper  border 
Xe  =  length  of  the  diagonal  of  the  central  picture. 
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X4 


Fig.  1.1  An  old  Swiss  1000-franc  bank  note 


These  data  are  taken  from  Flury  and  Riedwyl  (1988).  The  aim  is  to  study 
how  these  measurements  may  be  used  in  determining  whether  a  bill  is  genuine  or 
counterfeit. 

The  boxplot  is  a  graphical  technique  that  displays  the  distribution  of  variables.  It 
helps  us  see  the  location,  skewness,  spread,  tail  length  and  outlying  points. 

It  is  particularly  useful  in  comparing  different  batches.  The  boxplot  is  a  graphical 
representation  of  the  Five  Number  Summary.  To  introduce  the  Five  Number 
Summary,  let  us  consider  for  a  moment  a  smaller,  one-dimensional  data  set: 
the  population  of  the  15  largest  world  cities  in  2006  (Table  1.1). 

In  the  Five  Number  Summary,  we  calculate  the  upper  quartile  Fjj  ,  the  lower  quar- 
tile  Fl,  the  median  and  the  extremes.  Recall  that  order  statistics  {x(p,  xp), . . . ,  Jt(n)} 
are  a  set  of  ordered  values  x\ ,  X2, . . . ,  xn  where  X(p  denotes  the  minimum  and  X(,7) 
the  maximum.  The  median  M  typically  cuts  the  set  of  observations  in  two  equal 
parts,  and  is  defined  as 


1 

2 


+  Vf+O 


n  odd 
n  even 


M  = 


(1.1) 
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Table  1.1  The  15  largest 
world  cities  in  2006 


City 

Country 

Pop.  (10,000) 

Order  statistics 

Tokyo 

Japan 

3,420 

*(15) 

Mexico  city 

Mexico 

2,280 

*(14) 

Seoul 

South  Korea 

2,230 

*(13) 

New  York 

USA 

2,190 

*(12) 

Sao  Paulo 

Brazil 

2,020 

*(ii) 

Bombay 

India 

1,985 

*(10) 

Delhi 

India 

1,970 

*(9) 

Shanghai 

China 

1,815 

*(8) 

Los  Angeles 

USA 

1,800 

*(7) 

Osaka 

Japan 

1,680 

*(6) 

Jakarta 

Indonesia 

1,655 

*(5) 

Calcutta 

India 

1,565 

*(4) 

Cairo 

Egypt 

1,560 

*(3) 

Manila 

Philippines 

1,495 

X(2) 

Karachi 

Pakistan 

1,430 

*(1) 

The  quartiles  cut  the  set  into  four  equal  parts,  which  are  often  called  fourths  (that  is 
why  we  use  the  letter  F).  Using  a  definition  that  goes  back  to  Hoaglin,  Mosteller, 
and  Tukey  (1983)  the  definition  of  a  median  can  be  generalised  to  fourths,  eights, 
etc.  Considering  the  order  statistics  we  can  define  the  depth  of  a  data  value  xp) 
as  mm{i,n  —  i  +  1}.  If  n  is  odd,  the  depth  of  the  median  is  2±i.  If  n  is  even, 

is  a  fraction.  Thus,  the  median  is  determined  to  be  the  average  between 
the  two  data  values  belonging  to  the  next  larger  and  smaller  order  statistics,  i.e. 

M  =  \  |x(|)  +  j-  In  our  example,  we  have  n  =  15  hence  the  median 

M  —  X(8)  =  1,815. 

We  proceed  in  the  same  way  to  get  the  fourths.  Take  the  depth  of  the  median  and 
calculate 


depth  of  fourth  = 


[depth  of  median]  +  1 
2 


with  [z]  denoting  the  largest  integer  smaller  than  or  equal  to  z.  In  our  example  this 
gives  4.5  and  thus  leads  to  the  two  fourths 


FL  =  ^  {*(4)  +  *(5)} 

F'v  =  ^  {X(H)  +  x(i2)} 


(recalling  that  a  depth  which  is  a  fraction  corresponds  to  the  average  of  the  two 
nearest  data  values). 
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Table  1.2  Five  number 
summary 


# 

15 

World  cities 

M 

8 

1,815 

F 

4.5 

1,610 

2,105 

1 

1,430 

3,420 

The  F -spread,  dF,  is  defined  as  dF  =  Fjj  —  Ff.  The  outside  bars 

Fjj  +  1. 5df  (1-2) 

Fl  —  l.5dF  (1.3) 

are  the  borders  beyond  which  a  point  is  regarded  as  an  outlier.  For  the  number  of 

points  outside  these  bars  see  Exercise  1.3.  For  the  n  —  15  data  points  the  fourths  are 
1610  =  2  |x(4)  +  X(5)}  and  2105  =  2  |x(n)  +  X(i2)}.  Therefore  the  F -spread  and 
the  upper  and  lower  outside  bars  in  the  above  example  are  calculated  as  follows: 

dF  =  Fv  -  Fl  =  2105  -  1610  =  495  (1.4) 

Fl  -  1 .5dF  =  1610  -  1.5  •  495  =  867.5  (1.5) 

Fv  +  1.5dF  =  2105  +  1.5  •  495  =  2847.5.  (1.6) 

Since  Tokyo  is  beyond  the  outside  bars  it  is  considered  to  be  an  outlier.  The  mini¬ 
mum  and  the  maximum  are  called  the  extremes.  The  mean  is  defined  as 

n 

x  —  n  ~~ 1  Xi , 

/  =  l 

which  is  1,939.7  in  our  example.  The  mean  is  a  measure  of  location.  The  median 
(1815),  the  fourths  ( 1 6 10;2 105)  and  the  extremes  (1430;3420)  constitute  basic 
information  about  the  data.  The  combination  of  these  five  numbers  leads  to  the  Five 
Number  Summary  as  shown  in  Table  1.2.  The  depths  of  each  of  the  five  numbers 
have  been  added  as  an  additional  column. 


Construction  of  the  Boxplot 

1.  Draw  a  box  with  borders  (edges)  at  Ff  and  Ff  (i.e.  50  %  of  the  data  are  in  this 
box). 

2.  Draw  the  median  as  a  solid  line  (|)  and  the  mean  as  a  dotted  line  (|). 

3.  Draw  “whiskers”  from  each  end  of  the  box  to  the  most  remote  point  that  is  NOT 
an  outlier. 

4.  Show  outliers  as  either  “★’’or  “•’’depending  on  whether  they  are  outside  of  FUL  ± 
1 .5dF  or  Fjjl  ±3 dF  respectively  (this  feather  is  not  contained  in  some  software). 
Label  them  if  possible. 
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1  Comparison  of  Batches 


Boxplot 
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Fig.  1.2  Boxplot  for  world  cities  Q  MVAboxcity 


In  the  world  cities  example,  the  cut-off  points  (outside  bars)  are  at  867.5  and 
2847.5,  hence  we  can  draw  whiskers  to  Karachi  and  Mexico  City.  We  can  see  from 
Fig.  1.2  that  the  data  are  very  skew:  The  upper  half  of  the  data  (above  the  median) 
is  more  spread  out  than  the  lower  half  (below  the  median),  the  data  contains  one 
outlier  marked  as  a  circle  and  the  mean  (as  a  non-robust  measure  of  location)  is 
pulled  away  from  the  median. 

Boxplots  are  very  useful  tools  in  comparing  batches.  The  relative  location  of 
the  distribution  of  different  batches  tells  us  a  lot  about  the  batches  themselves. 
Before  we  come  back  to  the  Swiss  bank  data,  let  us  compare  the  fuel  economy 
of  vehicles  from  different  countries,  see  Fig.  1.3  and  Table  22.3. 

Example  1.2  The  data  are  from  the  second  column  of  Table  22.3  and  show 
the  mileage  (miles  per  gallon)  of  American,  Japanese  and  European  cars. 
The  five-number  summaries  for  these  data  sets  are  {12,16.8,18.8,22,30}, 
{18, 22, 25,  30.5,  35}  and  {14, 19, 23,  25,  28}  for  American,  Japanese  and  European 
cars,  respectively.  This  reflects  the  information  shown  in  Fig.  1.3.  The  following 
conclusions  can  be  made: 

•  Japanese  cars  achieve  higher  fuel  efficiency  than  US  and  European  cars. 

•  There  is  one  outlier,  a  very  fuel-efficient  car  (VW-Rabbit  Golf  Diesel). 

•  The  main  body  of  the  US  car  data  (the  box)  lies  below  the  Japanese  car  data. 

•  The  worst  Japanese  car  is  more  fuel-efficient  than  almost  50  %  of  the  US  cars. 

•  The  spread  of  the  Japanese  and  the  US  cars  are  almost  equal. 

•  The  median  of  the  Japanese  data  is  above  that  of  the  European  data  and  the  US 
data. 


1.1  Boxplots 
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Fig.  1.3  Boxplot  for  the 
mileage  of  American, 
Japanese  and  European  cars 
(from  left  to  right)  Q 
MVAboxcar 


Car  Data 


JAPAN 


Fig.  1.4  The  X6  variable  of 
Swiss  bank  data  (diagonal  of 
bank  notes)  Q 
MVAboxbank6 
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Table  1.3  Five  number 
summary 


# 

100 

Genuine  bank  notes 

M 

50.5 

141.5 

F 

25.75 

141.25 

141.8 

1 

140.65 

142.4 

Now  let  us  apply  the  boxplot  technique  to  the  bank  data  set.  In  Fig.  1.4  we 
show  the  parallel  boxplot  of  the  diagonal  variable  X$.  On  the  left  is  the  value  of 
the  genuine  bank  notes  and  on  the  right  the  value  of  the  counterfeit  bank  notes.  The 
five  number  summary  is  reported  in  Table  1.3  and  1.4. 
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1  Comparison  of  Batches 


Table  1.4  Five  number 
summary 


# 

100 

Counterfeit  bank  notes 

M 

50.5 

139.5 

F 

25.75 

139.2 

139.8 

1 

138.3 

140.65 

Fig.  1.5  The  X\  variable  of 
Swiss  bank  data  (length  of 
bank  notes)  Q 
MVAboxbankl 


Swiss  Bank  Notes 
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COUNTERFEIT 


One  sees  that  the  diagonals  of  the  genuine  bank  notes  tend  to  be  larger.  It  is 
harder  to  see  a  clear  distinction  when  comparing  the  length  of  the  bank  notes  X\ , 
see  Fig.  1.5.  There  are  a  few  outliers  in  both  plots.  Almost  all  the  observations  of 
the  diagonal  of  the  genuine  notes  are  above  the  ones  from  the  counterfeit  notes. 
There  is  one  observation  in  Fig.  1.4  of  the  genuine  notes  that  is  almost  equal  to 
the  median  of  the  counterfeit  notes.  Can  the  parallel  boxplot  technique  help  us 
distinguish  between  the  two  types  of  bank  notes? 


U 1* 


'  Summary 

*  The  median  and  mean  bars  are  measures  of  locations. 


^  The  relative  location  of  the  median  (and  the  mean)  in  the  box  is  a 
measure  of  how  skewed  it  is. 

^  The  length  of  the  box  and  whiskers  are  a  measure  of  spread. 

^  The  length  of  the  whiskers  indicate  the  tail  length  of  the  distribu¬ 
tion. 

^  The  outlying  points  are  indicated  with  a  or  depending  on 
if  they  are  outside  of  FUL  ±  \.5dp  or  FUL  ±3 dp  respectively. 


1.2  Histograms 
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Summary  (continued) 

The  boxplots  do  not  indicate  multi-modality  or  clusters. 

If  we  compare  the  relative  size  and  location  of  the  boxes,  we  are 
comparing  distributions. 


1.2  Histograms 

Histograms  are  density  estimates.  A  density  estimate  gives  a  good  impression  of  the 
distribution  of  the  data.  In  contrast  to  boxplots,  density  estimates  show  possible 
multimodality  of  the  data.  The  idea  is  to  locally  represent  the  data  density  by 
counting  the  number  of  observations  in  a  sequence  of  consecutive  intervals  (bins) 
with  origin  xq.  Let  Bj  (xo,  h )  denote  the  bin  of  length  h  which  is  the  element  of  a 
bin  grid  starting  at  Xo  : 


B j  (x0 ,  h)  =  [x0  +  (j  -  1  )h,x0  +jh),  j  e  Z, 

where  [., .)  denotes  a  left  closed  and  right  open  interval.  If  {x*  }"=1  is  an  i.i.d.  sample 
with  density  /,  the  histogram  is  defined  as  follows: 

n 

fh(x)  =  n~lh~l  I{xi  e  Bj(xo,h)}I{x  e  Bj(x0,h)}.  (1.7) 

j GZ  7  —1 

In  sum  (1.7)  the  first  indicator  function  /{x?  £  Bj(xo,h)}  (see  Symbols  and 
Notation  in  Chap.  21)  counts  the  number  of  observations  falling  into  bin  B  j  (xo,  h). 
The  second  indicator  function  is  responsible  for  “localising”  the  counts  around  x. 
The  parameter  h  is  a  smoothing  or  localising  parameter  and  controls  the  width  of 
the  histogram  bins.  An  h  that  is  too  large  leads  to  very  big  blocks  and  thus  to  a 
very  unstructured  histogram.  On  the  other  hand,  an  h  that  is  too  small  gives  a  very 
variable  estimate  with  many  unimportant  peaks. 

The  effect  of  h  is  given  in  detail  in  Fig.  1.6.  It  contains  the  histogram  (upper 
left)  for  the  diagonal  of  the  counterfeit  bank  notes  for  Xo  =  137.8  (the  minimum 
of  these  observations)  and  h  —  0.1.  Increasing  h  to  h  =  0.2  and  using  the  same 
origin,  xo  =  137.8,  results  in  the  histogram  shown  in  the  lower  left  of  the  figure. 
This  density  histogram  is  somewhat  smoother  due  to  the  larger  h.  The  binwidth  is 
next  set  to  h  —  0.3  (upper  right).  From  this  histogram,  one  has  the  impression  that 
the  distribution  of  the  diagonal  is  bimodal  with  peaks  at  about  138.5  and  139.9. 
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1  Comparison  of  Batches 


h  =  0.1 


h  =  0.3 


h  =  0.2 


h  =  0.4 


Fig.  1.6  Diagonal  of  counterfeit  bank  notes.  Histograms  with  x0  =  137.8  and  h  =  0.1  ( upper 
left),  h  =  0.2  ( lower  left),  h  =  0.3  ( upper  right),  h  =  0.4  ( lower  right)  Q  MVAhisbankl 


The  detection  of  modes  requires  fine  tuning  of  the  bin  width.  Using  methods  from 
smoothing  methodology  (Hardle,  Muller,  Sperlich,  &  Werwatz,  2004)  one  can  find 
an  “optimal”  binwidth  h  for  n  observations: 


h 


opt  — 


Unfortunately,  the  binwidth  h  is  not  the  only  parameter  determining  the  shapes  of  / . 

In  Fig.  1.7,  we  show  histograms  with  xo  —  137.65  (upper  left),  Xo  =  137.75 
(lower  left),  with  x0  =  137.85  (upper  right),  and  Xo  =  137.95  (lower  right).  All 
the  graphs  have  been  scaled  equally  on  the  y-axis  to  allow  comparison.  One  sees 
that — despite  the  fixed  binwidth  h — the  interpretation  is  not  facilitated.  The  shift 
of  the  origin  xq  (to  four  different  locations)  created  four  different  histograms.  This 


1.2  Histograms 
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x  =  137.65  x  =  137.85 


x  =  137.75  x  =  137.95 

o  o 

Fig.  1.7  Diagonal  of  counterfeit  bank  notes.  Histogram  with  h  =  0.4  and  origins  xo  =  137.65 
( upper  left),  Xo  =  137.75  ( lower  left),  Xq  =  137.85  ( upper  right),  Xo  =  137.95  ( lower  right)  Q 
MVAhisbank2 


property  of  histograms  strongly  contradicts  the  goal  of  presenting  data  features. 
Obviously,  the  same  data  are  represented  quite  differently  by  the  four  histograms.  A 
remedy  has  been  proposed  by  Scott  (1985):  ‘Average  the  shifted  histograms!”.  The 
result  is  presented  in  Fig.  1.8. 

Here  all  bank  note  observations  (genuine  and  counterfeit)  have  been  used.  The 
(so-called)  averaged  shifted  histogram  is  no  longer  dependent  on  the  origin  and 
shows  a  clear  bimodality  of  the  diagonals  of  the  Swiss  bank  notes. 
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Swiss  Bank  Notes 
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Swiss  Bank  Notes 


4  shifts 


Swiss  Bank  Notes 


8  shifts 


Swiss  Bank  Notes 


16  shifts 


Fig.  1.8  Averaged  shifted  histograms  based  on  all  (counterfeit  and  genuine)  Swiss  bank  notes: 
there  are  2  shifts  ( upper  left),  4  shifts  ( lower  left),  8  shifts  ( upper  right)  and  16  shifts  ( lower  right) 
Q MVAashbank 


'  Summary 

^  Modes  of  the  density  are  detected  with  a  histogram. 

^  Modes  correspond  to  strong  peaks  in  the  histogram. 

^  Histograms  with  the  same  h  need  not  be  identical.  They  also 
depend  on  the  origin  xo  of  the  grid. 

^  The  influence  of  the  origin  xo  is  drastic.  Changing  xo  creates 
different  looking  histograms. 

^  The  consequence  of  an  h  that  is  too  large  is  an  unstructured 
histogram  that  is  too  flat. 

A  binwidth  h  that  is  too  small  results  in  an  unstable  histogram. 


1.3  Kernel  Densities 
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Summary  (continued) 

There  is  an  ‘‘optimal”  h  —  (24 ^/n/  n)1^ . 

It  is  recommended  to  use  averaged  histograms.  They  are  kernel 
densities. 


1.3  Kernel  Densities 


The  major  difficulties  of  histogram  estimation  may  be  summarised  in  four  cri¬ 
tiques: 

•  determination  of  the  bin  width  h,  which  controls  the  shape  of  the  histogram, 

•  choice  of  the  bin  origin  xo,  which  also  influences  to  some  extent  the  shape, 

•  loss  of  information  since  observations  are  replaced  by  the  central  point  of  the 
interval  in  which  they  fall, 

•  the  underlying  density  function  is  often  assumed  to  be  smooth,  but  the  histogram 
is  not  smooth. 

Rosenblatt  (1956),  Whittle  (1958)  and  Parzen  (1962)  developed  an  approach 
which  avoids  the  last  three  difficulties.  First,  a  smooth  kernel  function  rather  than 
a  box  is  used  as  the  basic  building  block.  Second,  the  smooth  function  is  centred 
directly  over  each  observation.  Let  us  study  this  refinement  by  supposing  that  x  is 
the  centre  value  of  a  bin.  The  histogram  can  in  fact  be  rewritten  as 


fh(x)  =n  lh  1  ^2  1 

i  =  1 

If  we  define  K(u)  =  /(|m|  <  i),  then  (1.8)  changes  to 


fh(x)  —  n  lh  1 


i 


(1.8) 


(1.9) 


This  is  the  general  form  of  the  kernel  estimator.  Allowing  smoother  kernel  functions 
like  the  quartic  kernel, 


K(u )  = 


15 

16 


(l-M2)2  I(\u\  <  1), 


and  computing  x  not  only  at  bin  centers  gives  us  the  kernel  density  estimator. 
Kernel  estimators  can  also  be  derived  via  weighted  averaging  of  rounded  points 
(WARPing)  or  by  averaging  histograms  with  different  origins,  see  Scott  (1985). 
Table  1.5  introduces  some  commonly  used  kernels. 
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1  Comparison  of  Batches 


Table  1.5  Kernel  functions 


K(») 

Kernel 

K(u)  =  \l{\u\  <  1) 

Uniform 

H 

VI 

~a 

~a 

1 

II 

's' 

Triangle 

K(u)  =  |(1  —  u2)I(\u\  <  1) 

Epanechnikov 

K(u)  =  jf  (1  -  u2)2I(\u\  <  1) 

Quartic  (Biweight) 

K(u)  -  J —  exp(  “  )  -  <p(u) 

Gaussian 

Different  kernels  generate  different  shapes  of  the  estimated  density.  The  most 
important  parameter  is  the  so-called  bandwidth  h ,  and  can  be  optimised,  for  exam¬ 
ple,  by  cross-validation;  see  Hardle  (1991)  for  details.  The  cross-validation  method 
minimises  the  integrated  squared  error.  This  measure  of  discrepancy  is  based  on 


the  squared  differences  yfh(x)  —  /(x)>  .  Averaging  these  squared  deviations  over 
a  grid  of  points  {*/}f=1  leads  to 


L  1 X!  (A*/)  -  /(*/)}  • 

l=\ 


Asymptotically,  if  this  grid  size  tends  to  zero,  we  obtain  the  integrated  squared  error: 


fh  00  -  fix'. 


2 

dx. 


In  practice,  it  turns  out  that  the  method  consists  of  selecting  a  bandwidth  that 
minimises  the  cross-validation  function 


J  fh 

7—1 


/V 

where  fhj  is  the  density  estimate  obtained  by  using  all  datapoints  except  for  the  i  -th 
observation.  Both  terms  in  the  above  function  involve  double  sums.  Computation 
may  therefore  be  slow.  There  are  many  other  density  bandwidth  selection  methods. 
Probably  the  fastest  way  to  calculate  this  is  to  refer  to  some  reasonable  reference 
distribution.  The  idea  of  using  the  Normal  distribution  as  a  reference,  for  example, 
goes  back  to  Silverman  (1986).  The  resulting  choice  of  h  is  called  the  rule  of  thumb. 

For  the  Gaussian  kernel  from  Table  1.5  and  a  Normal  reference  distribution,  the 
rule  of  thumb  is  to  choose 


-1/5 


he  —  1.0  6c>n 


(1.10) 


1.3  Kernel  Densities 
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Fig.  1.9  Densities  of  the 
diagonals  of  genuine  and 
counterfeit  bank  notes.  q  g 

Automatic  density 
estimates  ^MVAdenbank 

0.6 


0.4 


0.2 


0 

137  138  139  140  141  142  143 

Counterfeit  /  Genuine 


where  d  =  y]n~l  Y^=i(xi  —  x)2  denotes  the  sample  standard  deviation.  This 
choice  of  he  optimises  the  integrated  squared  distance  between  the  estimator  and 
the  true  density.  For  the  quartic  kernel,  we  need  to  transform  (1.10).  The  modified 
rule  of  thumb  is: 


hQ  =  2.62  •  hG.  (1.11) 

Figure  1.9  shows  the  automatic  density  estimates  for  the  diagonals  of  the  coun¬ 
terfeit  and  genuine  bank  notes.  The  density  on  the  left  is  the  density  corresponding 
to  the  diagonal  of  the  counterfeit  data.  The  separation  is  clearly  visible,  but  there  is 
also  an  overlap.  The  problem  of  distinguishing  between  the  counterfeit  and  genuine 
bank  notes  is  not  solved  by  just  looking  at  the  diagonals  of  the  notes.  The  question 
arises  whether  a  better  separation  could  be  achieved  using  not  only  the  diagonals, 
but  one  or  two  more  variables  of  the  data  set.  The  estimation  of  higher  dimensional 
densities  is  analogous  to  that  of  one  dimensional.  We  show  a  two-dimensional 
density  estimate  for  X4  and  X5  in  Fig.  1.10.  The  contour  lines  indicate  the  height 
of  the  density.  One  sees  two  separate  distributions  in  this  higher  dimensional  space, 
but  they  still  overlap  to  some  extent. 

We  can  add  one  more  dimension  and  give  a  graphical  representation  of  a  three- 
dimensional  density  estimate,  or  more  precisely  an  estimate  of  the  joint  distribution 
of  X4,  X5  and  X$.  Figure  1.11  shows  the  contour  areas  at  three  different  levels  of 
the  density:  0.2  (green),  0.4  (red)  and  0.6  (blue)  of  this  three-dimensional  density 
estimate.  One  can  clearly  recognise  two  “ellipsoids”  (at  each  level),  but  as  before, 
they  overlap.  In  Chap.  14  we  will  learn  how  to  separate  the  two  ellipsoids  and  how 
to  develop  a  discrimination  rule  to  distinguish  between  these  data  points. 
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1  Comparison  of  Batches 


Fig.  1.10  Contours  of  the 
density  of  X5  and  Xe  of 
genuine  and  counterfeit  bank 
notes  Q  MVAcontbank2 
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Fig.  1.11  Contours  of  the 
density  of  X4,  X5,  X6  of 
genuine  and  counterfeit  bank 
notes  Q  MVAcontbank3 


1.4  Scatterplots 
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Summary 

Kernel  densities  estimate  distribution  densities  by  the  kernel 
method. 

^  The  bandwidth  h  determines  the  degree  of  smoothness  of  the 

/V 

estimate  /. 

^  Kernel  densities  are  smooth  functions  and  they  can  graphically 
represent  distributions  (up  to  three  dimensions). 

^  A  simple  (but  not  necessarily  correct)  way  to  find  a  good  bandwidth 
is  to  compute  the  rule  of  thumb  bandwidth  he  —  1.0 6on~1^ . 
This  bandwidth  is  to  be  used  only  in  combination  with  a  Gaussian 
kernel  <p. 

Kernel  density  estimates  are  a  good  descriptive  tool  for  seeing 
modes,  location,  skewness,  tails,  asymmetry,  etc. 


1.4  Scatterplots 

Scatterplots  are  bivariate  or  trivariate  plots  of  variables  against  each  other.  They  help 
us  understand  relationships  among  the  variables  of  a  data  set.  A  downward- sloping 
scatter  indicates  that  as  we  increase  the  variable  on  the  horizontal  axis,  the  variable 
on  the  vertical  axis  decreases.  An  analogous  statement  can  be  made  for  upward- 
sloping  scatters. 

Figure  1.12  plots  the  5th  column  (upper  inner  frame)  of  the  bank  data  against 
the  6th  column  (diagonal).  The  scatter  is  downward-sloping.  As  we  already  know 
from  the  previous  section  on  marginal  comparison  (e.g.  Fig.  1.9)  a  good  separation 
between  genuine  and  counterfeit  bank  notes  is  visible  for  the  diagonal  variable. 
The  sub-cloud  in  the  upper  half  (circles)  of  Fig.  1.12  corresponds  to  the  true  bank 
notes.  As  noted  before,  this  separation  is  not  distinct,  since  the  two  groups  overlap 
somewhat. 

This  can  be  verified  in  an  interactive  computing  environment  by  showing  the 
index  and  coordinates  of  certain  points  in  this  scatterplot.  In  Fig.  1.12,  the  70th 
observation  in  the  merged  data  set  is  given  as  a  thick  circle,  and  it  is  from  a  genuine 
bank  note.  This  observation  lies  well  embedded  in  the  cloud  of  counterfeit  bank 
notes.  One  straightforward  approach  that  could  be  used  to  tell  the  counterfeit  from 
the  genuine  bank  notes  is  to  draw  a  straight  line  and  define  notes  above  this  value  as 
genuine.  We  would  of  course  misclassify  the  70th  observation,  but  can  we  do  better? 


20 


1  Comparison  of  Batches 


Swiss  bank  notes 


Fig.  1.12  2D  scatterplot  for  X5  vs.  X6  of  the  bank  notes.  Genuine  notes  are  circles ,  counterfeit 
notes  are  stars  Q  MVAscabank5  6 


Swiss  bank  notes 


Fig.  1.13  3D  scatterplot  of  the  bank  notes  for  (X4,  X5,  X$).  Genuine  notes  are  circles ,  counterfeit 
are  stars  Q  MVAscabank4  5  6 


If  we  extend  the  two-dimensional  scatterplot  by  adding  a  third  variable,  e.g.  X4 
(lower  distance  to  inner  frame),  we  obtain  the  scatterplot  in  three  dimensions  as 
shown  in  Fig.  1.13.  It  becomes  apparent  from  the  location  of  the  point  clouds  that  a 
better  separation  is  obtained.  We  have  rotated  the  three-dimensional  data  until  this 
satisfactory  3D  view  was  obtained.  Later,  we  will  see  that  the  rotation  is  the  same 
as  bundling  a  high-dimensional  observation  into  one  or  more  linear  combinations 


1.4  Scatterplots 
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Fig.  1.14  Draftman’s  plot  of  the  bank  notes.  The  pictures  in  the  left-hand  column  show 
(X3,X4),  (X3,X5)  and  (X3,X6),  in  the  middle  we  have  (X4,X5)  and  (X4,X6),  and  in  the 
lower  right  (X5,  X6).  The  upper  right  half  contains  the  corresponding  density  contour  plots  Q 
MVAdrafbank4 


of  the  elements  of  the  observation  vector.  In  other  words,  the  “separation  line" 
parallel  to  the  horizontal  coordinate  axis  in  Fig.  1.12  is,  in  Fig.  1.13,  a  plane  and 
no  longer  parallel  to  one  of  the  axes.  The  formula  for  such  a  separation  plane  is  a 
linear  combination  of  the  elements  of  the  observation  vector: 


a\X\  +  <22X2  +  •  •  •  +  <26*6  =  const.  (1.12) 

The  algorithm  that  automatically  finds  the  weights  (<21 , . . . ,  <26)  will  be  investigated 
later  on  in  Chap.  14. 

Let  us  study  yet  another  technique:  the  scatterplot  matrix.  If  we  want  to  draw  all 
possible  two-dimensional  scatterplots  for  the  variables,  we  can  create  a  so-called 
draftman’s  plot  (named  after  a  draftman  who  prepares  drafts  for  parliamentary 
discussions).  Similar  to  a  draftman’s  plot  the  scatterplot  matrix  helps  in  creating 
new  ideas  and  in  building  knowledge  about  dependencies  and  structure. 

Figure  1.14  shows  a  draftman’s  plot  applied  to  the  last  four  columns  of  the  full 
bank  data  set.  For  ease  of  interpretation  we  have  distinguished  between  the  group  of 
counterfeit  and  genuine  bank  notes  by  a  different  colour.  As  discussed  several  times 
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1  Comparison  of  Batches 


earlier,  the  separability  of  the  two  types  of  notes  is  different  for  different  scatterplots. 
Not  only  is  it  difficult  to  perform  this  separation  on,  say,  scatterplot  X3  vs.  X4,  in 
addition  the  “separation  line”  is  no  longer  parallel  to  one  of  the  axes.  The  most 
obvious  separation  happens  in  the  scatterplot  in  the  lower  right-hand  side  where 
indicated,  as  in  Fig.  1.12,  X5  vs.  X$.  The  separation  line  here  would  be  upward- 
sloping  with  an  intercept  at  about  X6  —  139.  The  upper  right  half  of  the  draftman’s 
plot  shows  the  density  contours  that  we  introduced  in  Sect.  1.3. 

The  power  of  the  draftman’s  plot  lies  in  its  ability  to  show  the  internal 
connections  of  the  scatter  diagrams.  Define  a  brush  as  a  re-scalable  rectangle  that  we 
can  move  via  keyboard  or  mouse  over  the  screen.  Inside  the  brush  we  can  highlight 
or  colour  observations.  Suppose  the  technique  is  installed  in  such  a  way  that  as  we 
move  the  brush  in  one  scatter,  the  corresponding  observations  in  the  other  scatters 
are  also  highlighted.  By  moving  the  brush,  we  can  study  conditional  dependence. 

If  we  brush  (i.e.  highlight  or  colour  the  observation  with  the  brush),  the  X5  vs. 
Xe  plot  and  move  through  the  upper  point  cloud,  we  see  that  in  other  plots  (e.g.  X3 
vs.  X4),  the  corresponding  observations  are  more  embedded  in  the  other  sub-cloud. 


Summary 

^  Scatterplots  in  two  and  three  dimensions  helps  in  identifying 
separated  points,  outliers  or  sub-clusters. 

Scatterplots  help  us  in  judging  positive  or  negative  dependencies. 

Draftman  scatterplot  matrices  help  detect  structures  conditioned  on 
values  of  other  variables. 

^  As  the  brush  of  a  scatterplot  matrix  moves  through  a  point  cloud, 
we  can  study  conditional  dependence. 


1.5  Chemoff-Flury  Faces 

If  we  are  given  data  in  numerical  form,  we  tend  to  also  display  it  numerically.  This 
was  done  in  the  preceding  sections:  an  observation  x\  =  (1,2)  was  plotted  as 
the  point  (1,2)  in  a  two-dimensional  coordinate  system.  In  multivariate  analysis 
we  want  to  understand  data  in  low  dimensions  (e.g.  on  a  2D  computer  screen) 
although  the  structures  are  hidden  in  high  dimensions.  The  numerical  display  of 
data  structures  using  coordinates  therefore  ends  at  dimensions  greater  than  three. 


1.5  Chemoff-Flury  Faces 
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If  we  are  interested  in  condensing  a  structure  into  2D  elements,  we  have  to 
consider  alternative  graphical  techniques.  The  Chernoff-Flury  faces,  for  example, 
provide  such  a  condensation  of  high-dimensional  information  into  a  simple  “face”. 
In  fact  faces  are  a  simple  way  of  graphically  displaying  high-dimensional  data.  The 
size  of  the  face  elements  like  pupils,  eyes,  upper  and  lower  hair  line,  etc.  are  assigned 
to  certain  variables.  The  idea  of  using  faces  goes  back  to  Chernoff  (1973)  and  has 
been  further  developed  by  Bernhard  Flury.  We  follow  the  design  described  in  Flury 
and  Riedwyl  (1988)  which  uses  the  following  characteristics. 

1 .  right  eye  size 

2.  right  pupil  size 

3.  position  of  right  pupil 

4.  right  eye  slant 

5.  horizontal  position  of  right  eye 

6.  vertical  position  of  right  eye 

7.  curvature  of  right  eyebrow 

8.  density  of  right  eyebrow 

9.  horizontal  position  of  right  eyebrow 

10.  vertical  position  of  right  eyebrow 

11.  right  upper  hair  line 

12.  right  lower  hair  line 

13.  right  face  line 

14.  darkness  of  right  hair 

15.  right  hair  slant 

16.  right  nose  line 

17.  right  size  of  mouth 

18.  right  curvature  of  mouth 
19-36.  like  1-18,  only  for  the  left  side. 

First,  every  variable  that  is  to  be  coded  into  a  characteristic  face  element  is 
transformed  into  a  (0, 1)  scale,  i.e.  the  minimum  of  the  variable  corresponds  to  0  and 
the  maximum  to  1 .  The  extreme  positions  of  the  face  elements  therefore  correspond 
to  a  certain  “grin”  or  “happy”  face  element.  Dark  hair  might  be  coded  as  1,  and 
blond  hair  as  0  and  so  on. 

As  an  example,  consider  the  observations  91-110  of  the  bank  data.  Recall  that 
the  bank  data  set  consists  of  200  observations  of  dimension  6  where,  for  example, 
X(>  is  the  diagonal  of  the  note.  If  we  assign  the  six  variables  to  the  following  face 
elements 


X\  —  1,  19  (eye  sizes) 
X2  =  2,  20  (pupil  sizes) 
X3  =  4,  22  (eye  slants) 
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Comparison  of  Batches 


Fig.  1.15  Chemoff-Flury  faces  for  observations  91-110  of  the  bank  notes  Q  MVAf  acebanklO 


X4  —  1 1 ,  29  (upper  hair  lines) 

X5  —  12,  30  (lower  hair  lines) 

X6  —  13,  14,  31,  32  (face  lines  and  darkness  of  hair), 

we  obtain  Fig.  1.15.  Also  recall  that  observations  1-100  correspond  to  the  genuine 
notes,  and  that  observations  101-200  correspond  to  the  counterfeit  notes.  The 
counterfeit  bank  notes  then  correspond  to  the  upper  half  of  Fig.  1.15.  In  fact  the 
faces  for  these  observations  look  more  grim  and  less  happy.  The  variable  X6 
(diagonal)  already  worked  well  in  the  boxplot  in  Fig.  1.4  in  distinguishing  between 
the  counterfeit  and  genuine  notes.  Here,  this  variable  is  assigned  to  the  face  line  and 
the  darkness  of  the  hair.  That  is  why  we  clearly  see  a  good  separation  within  these 
20  observations. 


1.5  Chemoff-Flury  Faces 


25 


Observations  1  to  50 
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Fig.  1.16  Chemoff-Flury  faces  for  observations  1-50  of  the  bank  notes  Q  MVAf acebankSO 


What  happens  if  we  include  all  100  genuine  and  all  100  counterfeit  bank  notes 
in  the  Chernoff-Flury  face  technique?  Figures  1.16  and  1.17  show  the  faces  of 
the  genuine  bank  notes  with  the  same  assignments  as  used  before,  and  Figs.  1.18 
and  1.19  show  the  faces  of  the  counterfeit  bank  notes.  Comparing  Figs.  1.16 
and  1.18  one  clearly  sees  that  the  diagonal  (face  line)  is  longer  for  genuine  bank 
notes.  Equivalently  coded  is  the  hair  darkness  (diagonal)  which  is  lighter  (shorter) 
for  the  counterfeit  bank  notes.  One  sees  that  the  faces  of  the  genuine  bank  notes 
have  a  much  darker  appearance  and  have  broader  face  lines.  The  faces  in  Figs.  1.16 
and  1.17  are  obviously  different  from  the  ones  in  Figs.  1.18  and  1.19. 
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Observations  51  to  100 
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Fig.  1.17  Chemoff-Flury  faces  for  observations  51-100  of  the  bank  notes  Q  MVAf acebankSO 


1.5  Chemoff-Flury  Faces 


27 


142 


143 


144 


Fig.  1.18  Chemoff-Flury  faces  for  observations  101-150  of  the  bank  notes  Q  MVAf acebankSO 
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Observations  151  to  200 
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Fig.  1.19  Chemoff-Flury  faces  for  observations  151-200  of  the  bank  notes  Q  MVAf acebankSO 


1.6  Andrews’  Curves 
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>  j'-  Summary 

^  Faces  can  be  used  to  detect  sub-groups  in  multivariate  data. 

^  Sub-groups  are  characterised  by  similar  looking  faces. 

^  Outliers  are  identified  by  extreme  faces,  e.g.  dark  hair,  smile  or  a 
happy  face. 

If  one  element  of  X  is  unusual,  the  corresponding  face  element 
significantly  changes  in  shape. 


1.6  Andrews’  Curves 


The  basic  problem  of  graphical  displays  of  multivariate  data  is  the  dimensionality. 
Scatterplots  work  well  up  to  three  dimensions  (if  we  use  interactive  displays). 
More  than  three  dimensions  have  to  be  coded  into  displayable  2D  or  3D  structures 
(e.g.  faces).  The  idea  of  coding  and  representing  multivariate  data  by  curves  was 
suggested  by  Andrews  (1972).  Each  multivariate  observation  Xt  —  (X/j , . . . ,  XitP) 
is  transformed  into  a  curve  as  follows: 


+  Xi,2  sin(0  +  Xf  2  cos (t)  +  •  •  • 
+Xi,p- 1  sin  (^-t)  +  Xi,p  cos  (^t) 


Xu_ 

,  V2 


+  Xia  sin(?)  +  X,-3  cos(?)  H - b  XiyP  sin  (|  t) 


for  p  odd 
for  p  even 

(1.13) 


the  observation  represents  the  coefficients  of  a  so-called  Fourier  series  ( t  e  [— tt,  7i]). 

Suppose  that  we  have  three-dimensional  observations:  X\  —  (0,0, 1),  X2  = 
(1,0,0)  and  X3  =  (0,1,0).  Here  p  —  3  and  the  following  representations 
correspond  to  the  Andrews’  curves: 


/1  it)  =  cos  it) 


flit)  = 


1 

v! 


and 


hf)  =  sin(Y). 


These  curves  are  indeed  quite  distinct,  since  the  observations  X\,  X2,  and  X3  are 
the  3D  unit  vectors:  each  observation  has  mass  only  in  one  of  the  three  dimensions. 
The  order  of  the  variables  plays  an  important  role. 
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1  Comparison  of  Batches 


Andrews  curves  (Bank  data) 


Fig.  1.20  Andrews’  curves  of  the  observations  96-105  from  the  Swiss  bank  note  data.  The  order 
of  the  variables  is  1,2, 3, 4, 5, 6  Q  MVAandcur 


Example  1.3  Let  us  take  the  96th  observation  of  the  Swiss  bank  note  data  set, 

X96  =  (215.6, 129.9, 129.9,  9.0,  9.5, 141.7). 

The  Andrews’  curve  is  by  (1.13): 

215  6 

(t )  =  — — — b  129.9  sin(f)  +  129.9  cos(f)  +  9.0sin(2f) 

V2 

+  9.5cos(2f)  +  141.7  sin(3f). 

Figure  1.20  shows  the  Andrews’  curves  for  observations  96-105  of  the  Swiss 
bank  note  data  set.  We  already  know  that  the  observations  96-100  represent  genuine 
bank  notes,  and  that  the  observations  101-105  represent  counterfeit  bank  notes.  We 
see  that  at  least  four  curves  differ  from  the  others,  but  it  is  hard  to  tell  which  curve 
belongs  to  which  group. 

We  know  from  Fig.  1.4  that  the  sixth  variable  is  an  important  one.  Therefore,  the 
Andrews’  curves  are  calculated  again  using  a  reversed  order  of  the  variables. 

Example  1.4  Let  us  consider  again  the  96th  observation  of  the  Swiss  bank  note  data 
set, 


X96  =  (215.6, 129.9, 129.9,  9.0,  9.5, 141.7). 


1.6  Andrews’  Curves 


31 


Fig.  1.21  Andrews’  curves 
of  the  observations  96-105 
from  the  Swiss  bank  note 
data.  The  order  of  the 
variables  is  6, 5, 4, 3, 2,1  i 
MVAandcur2 


Andrews  curves  (Bank  data) 


The  Andrews’  curve  is  computed  using  the  reversed  order  of  variables: 

141.7 

(j )  =  — — b  9.5sin(f)  +  9.0cos(7)  +  129.9  sin(2f) 

V2 

+  129.9cos(2f)  +  215.6  sin(3f). 

In  Fig.  1.21  the  curves  f%- /105  for  observations  96-105  are  plotted.  Instead  of  a 
difference  in  high  frequency,  now  we  have  a  difference  in  the  intercept,  which  makes 
it  more  difficult  for  us  to  see  the  differences  in  observations. 

This  shows  that  the  order  of  the  variables  plays  an  important  role  in  the 
interpretation.  If  X  is  high-dimensional,  then  the  last  variables  will  only  have  a 
small  visible  contribution  to  the  curve:  they  fall  into  the  high  frequency  part  of 
the  curve.  To  overcome  this  problem  Andrews  suggested  using  an  order  which 
is  suggested  by  Principal  Component  Analysis.  This  technique  will  be  treated  in 
detail  in  Chap.  1 1.  In  fact,  the  sixth  variable  will  appear  there  as  the  most  important 
variable  for  discriminating  between  the  two  groups.  If  the  number  of  observations 
is  more  than  20,  there  may  be  too  many  curves  in  one  graph.  This  will  result  in 
an  over  plotting  of  curves  or  a  bad  “signal-to-ink-ratio”,  see  Tufte  (1983).  It  is 
therefore  advisable  to  present  multivariate  observations  via  Andrews’  curves  only 
for  a  limited  number  of  observations. 
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i  Summary 

Outliers  appear  as  single  Andrews’  curves  that  look  different  from 
the  rest. 

A  sub-group  of  data  is  characterised  by  a  set  of  similar  curves. 

^  The  order  of  the  variables  plays  an  important  role  for  interpretation. 

^  The  order  of  variables  may  be  optimised  by  Principal  Component 
Analysis. 

^  For  more  than  20  observations  we  may  obtain  a  bad  “signal-to-ink 
ratio”,  i.e.  too  many  curves  are  overlaid  in  one  picture. 


1.7  Parallel  Coordinates  Plots 

PCP  is  a  method  for  representing  high-dimensional  data,  see  Inselberg  (1985). 
Instead  of  plotting  observations  in  an  orthogonal  coordinate  system,  PCP  draws 
coordinates  in  parallel  axes  and  connects  them  with  straight  lines.  This  method  helps 
in  representing  data  with  more  than  four  dimensions. 

One  first  scales  all  variables  to  max  =  1  and  min  =  0.  The  coordinate  index 
j  is  drawn  onto  the  horizontal  axis,  and  the  scaled  value  of  variable  is  mapped 
onto  the  vertical  axis.  This  way  of  representation  is  very  useful  for  high-dimensional 
data.  It  is  however  also  sensitive  to  the  order  of  the  variables,  since  certain  trends  in 
the  data  can  be  shown  more  clearly  in  one  ordering  than  in  another. 

Example  1.5  Take,  once  again,  the  observations  96-105  of  the  Swiss  bank  notes. 
These  observations  are  six  dimensional,  so  we  can’t  show  them  in  a  six-dimensional 
Cartesian  coordinate  system.  Using  the  PCP  technique,  however,  they  can  be  plotted 
on  parallel  axes.  This  is  shown  in  Fig.  1.22. 

PCP  can  also  be  used  for  detecting  linear  dependencies  between  variables: 
if  all  the  lines  are  of  almost  parallel  dimensions  ( p  —  2),  there  is  a  positive 
linear  dependence  between  them.  In  Fig.  1.23  we  display  the  two  variables  weight 
and  displacement  for  the  car  data  set  in  Sect.  22.3.  The  correlation  coefficient  p 
introduced  in  Sect.  3.2  is  0.9.  If  all  lines  intersect  visibly  in  the  middle,  there  is 
evidence  of  a  negative  linear  dependence  between  these  two  variables,  see  Fig.  1.24. 
In  fact  the  correlation  is  p  —  —0.82  between  two  variables  mileage  and  weight:  The 
more  the  weight,  the  less  the  mileage. 


1.7  Parallel  Coordinates  Plots 
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Parallel  coordinate  plot  (Bank  data) 


Fig.  1.22  Parallel  coordinates  plot  of  observations  96-105  Q  MVAparcool 

Parallel  Coordinate  Plot  (Car  Data) 


weight  displacement 

Fig.  1.23  Parallel  coordinates  plot  indicating  strong  positive  dependence  with  p  =  0.9,  X\  = 
weight,  X2  =  displacement  Q  MVApcp2 


Another  use  of  PCP  is  sub-groups  detection.  Lines  converging  to  different 
discrete  points  indicate  sub-groups.  Figure  1.25  shows  the  last  three  variables — 
displacement,  gear  ratio  for  high  gear  and  company’s  headquarters  of  the  car 
data;  we  see  convergence  to  the  last  variable.  This  last  variable  is  the  company’s 
headquarters  with  three  discrete  values:  USA,  Japan  and  Europe.  PCP  can  also 
be  used  for  outlier  detection.  Figure  1.26  shows  the  variables  headroom,  rear  seat 
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1  Comparison  of  Batches 


Fig.  1.24  Parallel 
coordinates  plot  showing 
strong  negative  dependence 
with  p  =  —0.82,  X i  = 
mileage,  X2  =  weight  Q 
MVApcp3 


Parallel  Coordinate  Plot(Car  Data) 


Fig.  1.25  Parallel 
coordinates  plot  with 
sub-groups  $  MVApcp4 


clearance  and  trunk  (boot)  space  in  the  car  data  set.  There  are  two  outliers  visible. 
The  boxplot  Fig.  1.27  confirms  this. 

PCPs  have  also  possible  shortcomings:  We  cannot  distinguish  observations  when 
two  lines  cross  at  one  point  unless  we  distinguish  them  clearly  (e.g.  by  different  line 
style).  In  Fig.  1.28,  observation  A  and  B  both  have  the  same  value  at  j  =  2.  Two 
lines  cross  at  one  point  here.  At  the  3rd  and  4th  dimension  we  cannot  tell  which  line 
belongs  to  which  observation.  A  dotted  line  for  A  and  solid  line  for  B  could  have 
helped  there. 

To  solve  this  problem  one  uses  an  interpolation  curve  instead  of  straight  lines,  e.g. 
cubic  curves  as  in  Graham  and  Kennedy  (2003).  Figure  1.29  is  a  variant  of  Fig.  1.28. 
In  Fig.  1.29,  with  a  natural  cubic  spline,  it  is  evident  how  to  follow  the  curves 
and  distinguish  the  observations.  The  real  power  of  PCP  comes  though  through 
colouring  sub-groups. 


1.7  Parallel  Coordinates  Plots 
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Fig.  1.26  PCP  for 

X\  =  headroom,  X2  =  rear 
seat  clearance  and 
X3  =  trunk  space  $ 
MVApcp5 


headroom  rear  seat  trunk  space 


Fig.  1.27  Boxplots  for 
headroom,  rear  seat  clearance 
and  trunk  space  Q 
MVApcp6  35 
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headroom  rear  seat  trunk  space 


Boxplot  (Car  Data) 


Example  1.6  Data  in  Fig.  1.30  are  coloured  according  to  Xy$ — car  company’s 
headquarters.  Red  stands  for  European  car,  green  for  Japan  and  black  for  US.  This 
PCP  with  colouring  can  provide  some  information  for  us: 

1.  US  cars  (black)  tend  to  have  large  value  in  X7,  X%,  Xg,  X\o,  X\\  (trunk  (boot) 
space,  weight,  length,  turning  diameter,  displacement),  which  means  US  cars  are 
generally  larger. 

2.  Japanese  cars  (green)  have  large  value  in  X3,  X4  (both  for  repair  record),  which 
means  Japanese  cars  tend  to  be  repaired  less. 
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Fig.  1.28  PCP  with 
intersection  for  given  data  3 

points  A  =  [0,  2,  3,  2]  and 
B  =  [3,  2,  2,  1]  Q  MVApcp7 
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Fig.  1.29  PCP  with  cubic 
spline  interpolation  Q  3 

MVApcp8 

2.5 

2 

1.5 

1 

0.5 

0 


Parallel  Coordinate  Plot  with  Cubic  Spline 


Parallel  Coordinate  Plot 
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Fig.  1.30  Parallel  coordinates  plot  for  car  data  Q  MVApcpl 


Summary 

Parallel  coordinates  plots  overcome  the  visualisation  problem  of 
the  Cartesian  coordinate  system  for  dimensions  greater  than  4. 

Outliers  are  visible  as  outlying  polygon  curves. 

^  The  order  of  variables  is  important,  especially  in  the  detection  of 
sub-groups. 

^  Sub-groups  may  be  screened  by  selective  colouring. 


1.8  Hexagon  Plots 

This  section  closely  follows  the  presentation  of  Lewin-Koh  (2006).  In  geometry,  a 
hexagon  is  a  polygon  with  six  edges  and  six  vertices.  Hexagon  binning  is  a  type  of 
bivariate  histogram  with  hexagon  borders.  It  is  useful  for  visualising  the  structure 


38 


1  Comparison  of  Batches 


of  data  sets  entailing  a  large  number  of  observations  n .  The  concept  of  hexagon 
binning  is  as  follows: 

1.  The  xy  plane  over  the  set  (range(x),  range(y))  is  tessellated  by  a  regular  grid  of 
hexagons. 

2.  The  number  of  points  falling  in  each  hexagon  is  counted. 

3.  The  hexagons  with  count  >  0  are  plotted  by  using  a  colour  ramp  or  varying  the 
radius  of  the  hexagon  in  proportion  to  the  counts. 

This  algorithm  is  extremely  fast  and  effective  for  displaying  the  structure  of  data 
sets  even  for  n  >  106.  If  the  size  of  the  grid  and  the  cuts  in  the  colour  ramp  are 
chosen  in  a  clever  fashion,  then  the  structure  inherent  in  the  data  should  emerge  in 
the  binned  plot.  The  same  caveats  apply  to  hexagon  binning  as  histograms.  Variance 
and  bias  vary  in  opposite  directions  with  bin  width,  so  we  have  to  settle  for  finding 
the  value  of  the  bin  width  that  yields  the  optimal  compromise  between  variance  and 
bias  reduction.  Clearly,  if  we  increase  the  size  of  the  grid,  the  hexagon  plot  appears 
to  be  smoother,  but  without  some  reasonable  criterion  on  hand  it  remains  difficult 
to  say  which  bin  width  provides  the  “optimal”  degree  of  smoothness.  The  default 
number  of  bins  suggested  by  standard  software  is  30. 

Applications  to  some  data  sets  are  shown  as  follows.  The  data  is  taken  from 
ALLBUS  (2006)[ZA  No. 3762].  The  number  of  respondents  is  2,946.  The  following 
nine  variables  have  been  selected  to  analyse  the  relation  between  each  pair  of 
variables. 


Xi 

Age 

x2 

Net  income 

X3 

Time  for  television  per  day  in  minutes 

X4 

Time  for  work  per  week  in  hours 

Xi 

Time  for  computer  per  week  in  hours 

X6 

Days  for  illness  yearly 

Xi 

Living  space  (square  metres) 

Xi 

Size 

x9 

Weight 

Firstly,  we  consider  two  variables  X\  —  Age  and  X2  =  Net  income  in  Fig.  1.31. 
The  top  left  picture  is  a  scatter  plot.  The  second  one  is  a  hexagon  plot  with  borders 
making  it  easier  to  see  the  separation  between  hexagons.  Looking  at  these  plots  one 
can  see  that  almost  all  individuals  have  a  net  monthly  income  of  less  than  2,000 
EUR.  Only  two  individuals  earn  more  than  10,000  EUR  per  month. 

Figure  1.32  shows  the  relation  between  X\  and  X5.  About  40  %  of  respondents 
from  20  to  80  years  old  do  not  use  a  computer  at  least  once  per  week.  The 
respondent  who  deals  with  a  computer  105  h  each  week  was  actually  not  in  full¬ 
time  employment. 
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Fig.  1.31  Hexagon  plots  between  X\  and  X2  Q  MVAage Income 
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Fig.  1.32  Hexagon  plot  between  X\  and  X5  Q  MVAage  Com 


Clearly,  people  who  earn  modest  incomes  live  in  smaller  flats.  The  trend  here 
is  relatively  clear  in  Fig.  1.33.  The  larger  the  net  income,  the  larger  the  flat.  A  few 
people  do  however  earn  high  incomes  but  live  in  small  flats. 
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Fig.  1.33  Hexagon  plot 
between  X2  and  X7  Q 
MVAincomeLi 
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Summary 


Hexagon  binning  is  a  type  of  bivariate  histogram,  used  for  visual¬ 
ising  large  data. 

Variance  and  bias  vary  in  opposite  directions  with  bin  width. 


Hexagons  have  the  property  of  ‘‘symmetry  of  the  nearest  neigh¬ 
bours”  which  lacks  in  square  bins. 

Hexagons  are  visually  less  biased  for  displaying  densities  than 
other  regular  tessellations. 


1.9  Boston  Housing 

Aim  of  the  Analysis 

The  Boston  Housing  data  set  was  analysed  by  Harrison  and  Rubinfeld  (1978)  who 
wanted  to  find  out  whether  “clean  air”  had  an  influence  on  house  prices.  We  will 
use  this  data  set  in  this  chapter  and  in  most  of  the  following  chapters  to  illustrate  the 
presented  methodology.  The  data  are  described  in  Sect.  22.1. 
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Fig.  1.34  Parallel 
coordinates  plot  for  Boston  1 

housing  data  Q 
MVApcphousing 
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What  Can  Be  Seen  from  the  PCPs 

In  order  to  highlight  the  relations  of  Xu  to  the  remaining  13  variables,  we  colour 
all  of  the  observations  with  Xu  >median(Xi4)  as  red  lines  in  Fig.  1.34.  Some  of 
the  variables  seem  to  be  strongly  related.  The  most  obvious  relation  is  the  negative 
dependence  between  Xu  and  Xu-  It  can  also  be  argued  that  a  strong  dependence 
exists  between  Xn  and  Xu  since  no  red  lines  are  drawn  in  the  lower  part  of  Xu. 
The  opposite  can  be  said  about  X\\ :  there  are  only  red  lines  plotted  in  the  lower  part 
of  this  variable.  Low  values  of  Xu  induce  high  values  of  X14. 

For  the  PCP,  the  variables  have  been  rescaled  over  the  interval  [0, 1]  for  better 
graphical  representations.  The  PCP  shows  that  the  variables  are  not  distributed  in 
a  symmetric  manner.  It  can  be  clearly  seen  that  the  values  of  X\  and  X9  are  much 
more  concentrated  around  0.  Therefore  it  makes  sense  to  consider  transformations 
of  the  original  data. 


Boston  Housing 


The  Scatterplot  Matrix 

One  characteristic  of  PCPs  is  that  many  lines  are  drawn  on  top  of  each  other.  This 
problem  is  reduced  by  depicting  the  variables  in  pairs  of  scatterplots.  Including  all 
14  variables  in  one  large  scatterplot  matrix  is  possible,  but  makes  it  hard  to  see 
anything  from  the  plots.  Therefore,  for  illustratory  purposes  we  will  analyse  only 
one  such  matrix  from  a  subset  of  the  variables  in  Fig.  1.35.  On  the  basis  of  the  PCP 
and  the  scatterplot  matrix  we  would  like  to  interpret  each  of  the  13  variables  and 
their  eventual  relation  to  the  14th  variable.  Included  in  the  figure  are  images  for 
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Fig.  1.35  Scatterplot  matrix 
for  variables  X\ , . . . ,  X5  and 
Xu  of  the  Boston  housing 
data  Q  MVAdraf  thou  sing 


Fig.  1.36  Scatterplot  matrix 
for  variables  X\ , . . . ,  X5  and 
X14  of  the  Boston  housing 
data  Q 
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X1-X5  and  X14,  although  each  variable  is  discussed  in  detail  below.  All  references 
made  to  scatterplots  in  the  following  refer  to  Fig.  1.35. 


Per- Capita  Crime  Rate  X\ 

Taking  the  logarithm  makes  the  variable’s  distribution  more  symmetric.  This  can  be 
seen  in  the  boxplot  of  X\  in  Fig.  1.37  which  shows  that  the  median  and  the  mean 
have  moved  closer  to  each  other  than  they  were  for  the  original  X\.  Plotting  the 
KDE  of  X\  —  log  (Xi)  would  reveal  that  two  sub-groups  might  exist  with  different 
mean  values.  However,  taking  a  look  at  the  scatterplots  in  Fig.  1.36  of  the  logarithms 
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which  include  X\  does  not  clearly  reveal  such  groups.  Given  that  the  scatterplot  of 
log  (Xi)  vs.  log  (X14)  shows  a  relatively  strong  negative  relation,  it  might  be  the 
case  that  the  two  sub-groups  of  X\  correspond  to  houses  with  two  different  price 
levels.  This  is  confirmed  by  the  two  boxplots  shown  to  the  right  of  the  X\  vs.  X2 
scatterplot  (in  Fig.  1.35):  the  right  boxplot’s  shape  differs  a  lot  from  the  black  one’s, 
having  a  much  higher  median  and  mean. 


Proportion  of  Residential  Area  Zoned  for  Large  Lots  X2 

It  strikes  the  eye  in  Fig.  1.35  that  there  is  a  large  cluster  of  observations  for  which 
X2  is  equal  to  0.  It  also  strikes  the  eye  that — as  the  scatterplot  of  X\  vs.  X2  shows — 
there  is  a  strong,  though  non-linear,  negative  relation  between  X\  and  X2;  almost  all 
observations  for  which  X2  is  high  have  an  X\ -value  close  to  zero,  and  vice  versa, 
many  observations  for  which  X2  is  zero  have  quite  a  high  per-capita  crime  rate  X\ . 
This  could  be  due  to  the  location  of  the  areas,  e.g.  urban  districts  might  have  a 
higher  crime  rate  and  at  the  same  time  it  is  unlikely  that  any  residential  land  would 
be  zoned  in  a  generous  manner. 

As  far  as  the  house  prices  are  concerned,  it  can  be  said  that  there  seems  to 
be  no  clear  (linear)  relation  between  X2  and  X14,  but  it  is  obvious  that  the  more 
expensive  houses  are  situated  in  areas  where  X2  is  large  (this  can  be  seen  from  the 
two  boxplots  on  the  second  position  of  the  diagonal,  where  the  red  one  has  a  clearly 
higher  mean/median  than  the  black  one). 


Proportion  of  Non-retail  Business  Acres  X3 

The  PCP  (in  Fig.  1.34)  as  well  as  the  scatterplot  of  X2  vs.  X14  shows  an  obvious 
negative  relation  between  X2  and  X14.  The  relationship  between  the  logarithms  of 
both  variables  seems  to  be  almost  linear.  This  negative  relation  might  be  explained 
by  the  fact  that  non-retail  business  sometimes  causes  annoying  sounds  and  other 
pollution.  Therefore,  it  seems  reasonable  to  use  X2  as  an  explanatory  variable  for 
the  prediction  of  X14  in  a  linear-regression  analysis. 

As  far  as  the  distribution  of  X2  is  concerned,  it  can  be  said  that  the  KDE  of  X2 
clearly  has  two  peaks,  which  indicates  that  there  are  two  sub-groups.  According  to 
the  negative  relation  between  X2  and  X14  it  could  be  the  case  that  one  sub-group 
corresponds  to  the  more  expensive  houses  and  the  other  one  to  the  cheaper  houses. 


Charles  River  Dummy  Variable  X4 

The  observation  made  from  the  PCP  that  there  are  more  expensive  houses  than 
cheap  houses  situated  on  the  banks  of  the  Charles  River  is  confirmed  by  inspecting 
the  scatterplot  matrix.  Still,  we  might  have  some  doubt  that  proximity  to  the  river 
influences  house  prices.  Looking  at  the  original  data  set,  it  becomes  clear  that  the 
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observations  for  which  X4  equals  one  are  districts  that  are  close  to  each  other. 
Apparently,  the  Charles  River  does  not  flow  through  very  many  different  districts. 
Thus,  it  may  be  pure  coincidence  that  the  more  expensive  districts  are  close  to  the 
Charles  River — their  high  values  might  be  caused  by  many  other  factors  such  as  the 
pupil/teacher  ratio  or  the  proportion  of  non-retail  business  acres. 


Nitric  Oxides  Concentration  X5 

The  scatterplot  of  X5  vs.  Xu  and  the  separate  boxplots  of  X5  for  more  and  less 
expensive  houses  reveal  a  clear  negative  relation  between  the  two  variables.  As  it 
was  the  main  aim  of  the  authors  of  the  original  study  to  determine  whether  pollution 
had  an  influence  on  housing  prices,  it  should  be  considered  very  carefully  whether 
X5  can  serve  as  an  explanatory  variable  for  price  Xu-  A  possible  reason  against  it 
being  an  explanatory  variable  is  that  people  might  not  like  to  live  in  areas  where  the 
emissions  of  nitric  oxides  are  high.  Nitric  oxides  are  emitted  mainly  by  automobiles, 
by  factories  and  from  heating  private  homes.  However,  as  one  can  imagine  there  are 
many  good  reasons  besides  nitric  oxides  not  to  live  in  urban  or  industrial  areas. 
Noise  pollution,  for  example,  might  be  a  much  better  explanatory  variable  for  the 
price  of  housing  units.  As  the  emission  of  nitric  oxides  is  usually  accompanied  by 
noise  pollution,  using  X5  as  an  explanatory  variable  for  Xu  might  lead  to  the  false 
conclusion  that  people  run  away  from  nitric  oxides,  whereas  in  reality  it  is  noise 
pollution  that  they  are  trying  to  escape. 


Average  Number  of  Rooms  per  Dwelling  X 6 

The  number  of  rooms  per  dwelling  is  a  possible  measure  of  the  size  of  the  houses. 
Thus  we  expect  X 5  to  be  strongly  correlated  with  Xu  (the  houses’  median  price). 
Indeed — apart  from  some  outliers — the  scatterplot  of  X 6  vs.  Xu  shows  a  point  cloud 
which  is  clearly  upward- sloping  and  which  seems  to  be  a  realisation  of  a  linear 
dependence  of  Xu  on  X&.  The  two  boxplots  of  Xe  confirm  this  notion  by  showing 
that  the  quartiles,  the  mean  and  the  median  are  all  much  higher  for  the  red  than  for 
the  black  boxplot. 


Proportion  of  Owner-Occupied  Units  Built  Prior  to  1940  Xq 

There  is  no  clear  connection  visible  between  X1  and  Xu-  There  could  be  a  weak 
negative  correlation  between  the  two  variables,  since  the  (red)  boxplot  of  X1  for  the 
districts  whose  price  is  above  the  median  price  indicates  a  lower  mean  and  median 
than  the  (black)  boxplot  for  the  district  whose  price  is  below  the  median  price.  The 
fact  that  the  correlation  is  not  so  clear  could  be  explained  by  two  opposing  effects. 
On  the  one  hand,  house  prices  should  decrease  if  the  older  houses  are  not  in  a  good 
shape.  On  the  other  hand,  prices  could  increase,  because  people  often  like  older 
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houses  better  than  newer  houses,  preferring  their  atmosphere  of  space  and  tradition. 
Nevertheless,  it  seems  reasonable  that  the  age  of  the  houses  has  an  influence  on  their 
price  X14. 

Raising  X1  to  the  power  of  2.5  reveals  again  that  the  data  set  might  consist  of 
two  sub-groups.  But  in  this  case  it  is  not  obvious  that  the  sub-groups  correspond 
to  more  expensive  or  cheaper  houses.  One  can  furthermore  observe  a  negative 
relation  between  X7  and  X%.  This  could  reflect  the  way  the  Boston  metropolitan 
area  developed  over  time;  the  districts  with  the  newer  buildings  are  further  away 
from  employment  centers  and  industrial  facilities. 


Weighted  Distance  to  Five  Boston  Employment  Centers  X$ 

Since  most  people  like  to  live  close  to  their  place  of  work,  we  expect  a  negative 
relation  between  the  distances  to  the  employment  centers  and  house  prices.  The 
scatterplot  hardly  reveals  any  dependence,  but  the  boxplots  of  X8  indicate  that  there 
might  be  a  slightly  positive  relation  as  the  red  boxplot’s  median  and  mean  are  higher 
than  the  black  ones.  Again,  there  might  be  two  effects  in  opposite  directions  at  work 
here.  The  first  is  that  living  too  close  to  an  employment  centre  might  not  provide 
enough  shelter  from  the  pollution  created  there.  The  second,  as  mentioned  above,  is 
that  people  do  not  travel  very  far  to  their  workplace. 


Index  of  Accessibility  to  Radial  Highways  Xg 

The  first  obvious  thing  one  can  observe  from  the  scatterplots,  as  well  in  the 
histograms  and  the  KDEs,  is  that  there  are  two  sub-groups  of  districts  containing  X9 
values  which  are  close  to  the  respective  group’s  mean.  The  scatterplots  deliver  no 
hint  as  to  what  might  explain  the  occurrence  of  these  two  sub-groups.  The  boxplots 
indicate  that  for  the  cheaper  and  for  the  more  expensive  houses  the  average  of  Xg  is 
almost  the  same. 


Full- Value  Property  Tax  X\o 

X\o  shows  behaviour  similar  to  that  of  Xg\  two  sub-groups  exist.  A  downward- 
sloping  curve  seems  to  underlie  the  relation  of  X\q  and  Xu.  This  is  confirmed  by 
the  two  boxplots  drawn  for  Aio:  the  red  one  has  a  lower  mean  and  median  than  the 
black  one. 


Pupil/Teacher  Ratio  X\\ 

The  red  and  black  boxplots  of  An  indicate  a  negative  relation  between  X\\  and  Xu- 
This  is  confirmed  by  inspection  of  the  scatterplot  of  X\\  vs.  Xu'-  The  point  cloud  is 
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downward  sloping,  i.e.  the  less  teachers  there  are  per  pupil,  the  less  people  pay  on 
median  for  their  dwellings. 


Proportion  of  African-American  B ,  Xu  =  1000(Z?  —  0.63)2/(/?  <  0.63) 

Interestingly,  Xu  is  negatively — though  not  linearly — correlated  with  A3,  X1  and 
An ,  whereas  it  is  positively  related  with  Ai4.  Looking  at  the  data  set  reveals  that  for 
almost  all  districts  A12  takes  on  a  value  around  390.  Since  B  cannot  be  larger  than 
0.63,  such  values  can  only  be  caused  by  B  close  to  zero.  Therefore,  the  higher  X\2 
is,  the  lower  the  actual  proportion  of  African-Americans  is.  Among  observations 
405-470  there  are  quite  a  few  that  have  a  Xi2  that  is  much  lower  than  390.  This 
means  that  in  these  districts  the  proportion  of  African-Americans  is  above  zero. 
We  can  observe  two  clusters  of  points  in  the  scatterplots  of  log  (A12):  one  cluster 
for  which  X\2  is  close  to  390  and  a  second  one  for  which  A12  is  between  3  and 
100.  When  X\2  is  positively  related  with  another  variable,  the  actual  proportion  of 
African-Americans  is  negatively  correlated  with  this  variable  and  vice  versa.  This 
means  that  African-Americans  live  in  areas  where  there  is  a  high  proportion  of  non¬ 
retail  business  land,  where  there  are  older  houses  and  where  there  is  a  high  (i.e.  bad) 
pupil/teacher  ratio.  It  can  be  observed  that  districts  with  housing  prices  above  the 
median  can  only  be  found  where  the  proportion  of  African-Americans  is  virtually 
zero. 


Proportion  of  Lower  Status  of  the  Population  A13 

Of  all  the  variables  A13  exhibits  the  clearest  negative  relation  with  A14 — hardly  any 
outliers  show  up.  Taking  the  square  root  of  A13  and  the  logarithm  of  A14  transforms 
the  relation  into  a  linear  one. 


Transformations 

Since  most  of  the  variables  exhibit  an  asymmetry  with  a  higher  density  on  the  left- 
hand  side,  the  following  transformations  are  proposed: 

X\  =  log  (Xi) 

%.  =  V/10 

%  =  log  (X3) 

A4  none,  since  A4  is  binary 

%  =  log  (Xs) 
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=  log  (X6) 

=  X72'5/10000 
=  log  (Xg) 

=  log  (x9) 

=  log  (*io) 

=  exp  (0.4  x  A'nVlOOO 
=  Xn/100 

=  7T7 

=  log  (Xu) 

Taking  the  logarithm  or  raising  the  variables  to  the  power  of  something  smaller 
than  one  helps  to  reduce  the  asymmetry.  This  is  due  to  the  fact  that  lower  values 
move  further  away  from  each  other,  whereas  the  distance  between  greater  values  is 
reduced  by  these  transformations. 

Figure  1.37  displays  boxplots  for  the  original  mean  variance  scaled  variables  as 
well  as  for  the  proposed  transformed  variables.  The  transformed  variables’  boxplots 
are  more  symmetric  and  have  less  outliers  than  the  original  variables’  boxplots. 


*6 

% 

% 

% 

xTo 

xTi 

xT2 

xTs 

xZ 


Fig.  1.37  Boxplots  for  all  of 
the  variables  from  the  Boston 
housing  data  before  and  after 
the  proposed  transformations 
Q MVAboxbhd 
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1  Comparison  of  Batches 


1.10  Exercises 

Exercise  1.1  Is  the  upper  extreme  always  an  outlier? 

Exercise  1.2  Is  it  possible  for  the  mean  or  the  median  to  lie  outside  of  the  fourths 
or  even  outside  of  the  outside  bars? 

Exercise  1.3  Assume  that  the  data  are  normally  distributed  N(f),  1).  What  percent¬ 
age  of  the  data  do  you  expect  to  lie  outside  the  outside  bars? 

Exercise  1.4  What  percentage  of  the  data  do  you  expect  to  lie  outside  the  outside 
bars  if  we  assume  that  the  data  are  normally  distributed  N(0,cr2)  with  unknown 
variance  a1  ? 

Exercise  1.5  How  would  the  five -number  summary  of  the  15  largest  US  cities  differ 
from  that  of  the  50  largest  US  cities?  How  would  the  five -number  summary  of  15 
observations  ofN(0,  l)-distributed  data  differ  from  that  of  50  observations  from  the 
same  distribution  ? 

Exercise  1.6  Is  it  possible  that  all  five  numbers  of  the  five -number  summary  could 
be  equal?  If  so,  under  what  conditions? 

Exercise  1.7  Suppose  we  have  50  observations  of  X  ~  N(0,  1)  and  another  50 
observations  ofY  ~  N(2,  1).  What  would  the  100  Flury  faces  look  like  if  you  had 
defined  as  face  elements  the  face  line  and  the  darkness  of  hair?  Do  you  expect  any 
similar  faces?  How  many  faces  do  you  think  should  look  like  observations  ofY  even 
though  they  are  X  observations? 

Exercise  1.8  Draw  a  histogram  for  the  mileage  variable  of  the  car  data 
(Sect.  22.3).  Do  the  same  for  the  three  groups  (USA,  Japan,  and  Europe ).  Do 
you  obtain  a  similar  conclusion  as  in  the  parallel  boxplot  in  Fig.  1.3  for  these  data? 

Exercise  1.9  Use  some  bandwidth  selection  criterion  to  calculate  the  optimally 
chosen  bandwidth  h  for  the  diagonal  variable  of  the  bank  notes.  Would  it  be  better 
to  have  one  bandwidth  for  the  two  groups? 

Exercise  1.10  In  Fig.  1.9  the  densities  overlap  in  the  region  of  diagonal  &  140.4. 
We  partially  observed  this  in  the  boxplot  of  Fig.  1.4.  Our  aim  is  to  separate  the  two 
groups.  Will  we  be  able  to  do  this  effectively  on  the  basis  of  this  diagonal  variable 
alone  ? 

Exercise  1.11  Draw  a  parallel  coordinates  plot  for  the  car  data. 

Exercise  1.12  How  would  you  identify  discrete  variables  (variables  with  only  a 
limited  number  of  possible  outcomes )  on  a  parallel  coordinates  plot? 

Exercise  1.13  True  or  false:  the  height  of  the  bars  of  a  histogram  are  equal  to  the 
relative  frequency  with  which  observations  fall  into  the  respective  bins. 

Exercise  1.14  True  or  false:  kernel  density  estimates  must  always  take  on  a  value 
between  0  and  1.  (Hint:  Which  quantity  connected  with  the  density  function  has  to 
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be  equal  to  1?  Does  this  property  imply  that  the  density  function  has  to  always  be 
less  than  1?) 

Exercise  1.15  Let  the  following  data  set  represent  the  heights  of  13  students  taking 
the  Applied  Multivariate  Statistical  Analysis  course: 


1.72, 1.83, 1.74, 1.79, 1.94, 1.81, 1.66, 1.60, 1.78, 1.77, 1.85, 1.70, 1.76. 


1.  Find  the  corresponding  five-number  summary. 

2.  Construct  the  boxplot. 

3.  Draw  a  histogram  for  this  data  set. 

Exercise  1.16  Describe  the  unemployment  data  (see  Table  22.19)  that  contain 
unemployment  rates  of  all  German  Federal  States  using  various  descriptive  tech¬ 
niques. 

Exercise  1.17  Using  yearly  population  data  (see  Sect.  22.20),  generate 

1.  a  boxplot  ( choose  one  of  variables) 

2.  an  Andrew's  Curve  (choose  ten  data  points) 

3.  a  scatterplot 

4.  a  histogram  (choose  one  of  the  variables) 

What  do  these  graphs  tell  you  about  the  data  and  their  structure  ? 

Exercise  1.18  Make  a  draftman  plot  for  the  car  data  with  the  variables 

X\  —  price , 

X2  =  mileage , 

Xg  =  weight , 

Xg  —  length. 

Move  the  brush  into  the  region  of  heavy  cars.  What  can  you  say  about  price,  mileage 
and  length?  Move  the  brush  onto  high  fuel  economy.  Mark  the  Japanese,  European 
and  American  cars.  You  should  find  the  same  condition  as  in  boxplot  Fig.  1.3. 

Exercise  1.19  What  is  the  form  of  a  scatterplot  of  two  independent  random 
variables  X\  and  X2  with  standard  normal  distribution? 

Exercise  1.20  Rotate  a  three-dimensional  standard  normal  point  cloud  in  3D 
space.  Does  it  “almost  look  the  same  from  all  sides  ”  ?  Can  you  explain  why  or 
why  not? 

Exercise  1.21  There  are  many  reasons  for  using  hexagons  to  visualise  the  structure 
of  data. 

1.  Hexagons  have  the  property  of  “symmetry  of  nearest  neighbours ”  which  lacks  in 
square  bins. 


50 


1  Comparison  of  Batches 


oo  — 

C\l  - 

>- 

o  - 

CM 
|  — 

k  k  k  k  k 

k  k  k  k  k 

k  k  k  k  k 

oo  — 

CM  - 

>- 

o  - 

CM 
|  — 

k  k  k  k  k 

• 

k  k  k  k  k 

k  k  k  k  k 

1  1 
-3 

—  00 

—  CM 

—  O 

1  1 
-3 

—  00 

—  CM 

—  O 

X 

X 

00  — 

00  — 

CM  - 

CM  - 

_ 

_ 

>- 

>- 

o  - 

*  : 

*  * 

o  - 

*  i 

‘  \  k  k 

CM 

| 

k  k  k  k  k 

CM 

| 

k  k  k  k  k 

1  1 
-3 

—  00 

-  CM 

-  O 

1  1 
-3 

—  00 

-  CM 

-  O 

X  X 

Fig.  1.38  Hexagon  binning  algorithm  Q  MVAhexaAl 


2.  Hexagons  have  the  maximum  number  of  sides  that  a  polygon  can  have  for  a 
regular  tessellation  of  the  plane. 

3.  Hexagons  are  visually  less  biased  for  displaying  densities  than  other  regular 
tessellations. 

The  hexagon  binning  algorithm  is  as  follows: 

1.  Decrease  y-axis  variable  by  a  factor  of  \/3  (making  the  calculation  more 
quickly) 

2.  Create  a  dual  lattice  ( circle  and  star  lines  in  Fig.  1.38) 

3.  Bin  each  point  into  a  pair  of  near  neighbour  rectangles 

4.  Choose  the  closest  of  the  rectangle  centers  (adjusting  for  y/3) 

The  rectangles  created  from  dual  lattice  have  length  hx  (bin  width  of  hexagons)  and 
height  hy  —  V3 hx.  From  these  rectangles  we  can  get  hexagons  with  bin  width  hx. 
The  first  point  of  the  star  lattice  has  coordinates  Xo  and  yo.  The  other  star  points 
will  have  coordinates  Xo  +  k\hx  and  yo  +  l\hy,  where  k\,l\  —  1,2,...  The  first 

point  of  the  circle  lattice  has  coordinates  Xo  +  and  yo  +  V^/,A .  Other  circle  points 
are  calculated  like  star  points.  Suppose  an  arbitrary  point  with  coordinates  x,  y  lies 
in  the  intersection  of  two  near  neighbour  rectangles.  What's  the  distance  from  this 
point  to  one  of  two  corners? 


Part  II 

Multivariate  Random  Variables 


Chapter  2 

A  Short  Excursion  into  Matrix  Algebra 


This  chapter  serves  as  a  reminder  of  basic  concepts  of  matrix  algebra,  which 
are  particularly  useful  in  multivariate  analysis.  It  also  introduces  the  notations 
used  in  this  book  for  vectors  and  matrices.  Eigenvalues  and  eigenvectors  play  an 
important  role  in  multivariate  techniques.  In  Sects.  2.2  and  2.3,  we  present  the 
spectral  decomposition  of  matrices  and  consider  the  maximisation  (minimisation) 
of  quadratic  forms  given  some  constraints. 

In  analysing  the  multivariate  normal  distribution,  partitioned  matrices  appear 
naturally.  Some  of  the  basic  algebraic  properties  are  given  in  Sect.  2.5.  These 
properties  will  be  heavily  used  in  Chaps.  4  and  5. 

The  geometry  of  the  multinormal  and  the  geometric  interpretation  of  the 
multivariate  techniques  (Part  III)  intensively  uses  the  notion  of  angles  between  two 
vectors,  the  projection  of  a  point  on  a  vector  and  the  distances  between  two  points. 
These  ideas  are  introduced  in  Sect.  2.6. 


2.1  Elementary  Operations 


A  matrix  A  is  a  system  of  numbers  with  n  rows  and  p  columns: 


/ d\\  a\2 
a  22 


A  = 


Ct\p 


\ 


&n\  &n2 . @ 


np 


/ 
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2  A  Short  Excursion  into  Matrix  Algebra 


Table  2.1  Special  matrices  and  vectors 


Name 

Definition 

Notation 

Example 

Scalar 

p  =  n  =  1 

a 

3 

Column  vector 

p  =  1 

a 

i 

(1) 

Row  vector 

n  =  1 

aT 

0  3) 

Vector  of  ones 

(i__i)T 

77 

In 

1 

(0 

Vector  of  zeros 

(0,...,0)T 

77 

0„ 

1 

C) 

Square  matrix 

n  =  p 

A{p  x  p) 

1 

( 2  0\ 
[0  2) 

Diagonal  matrix 

aij  =  0,  i  ^  j,n  =  p 

dia  g(au) 

1 

( i  o\ 
[0  2) 

Identity  matrix 

diag(l, . . . ,  1) 

p 

Xp 

1 

( i  o\ 

Unit  matrix 

aij  =  1  ,n  =  p 

N  1,7 

1 

(i  b 
0  b 

Symmetric  matrix 

®ij  &ji 

1 

( 1  2\ 
K 2  V 

Null  matrix 

aij  =  0 

0 

1 

✓ - 

o  o 

o  o 

Upper  triangular  matrix 

aij  =  0,  i  <  j 

(\  2  4\ 

0  1  3 

1°  0  V 

Idempotent  matrix 

AA  =  A 

(\  0  0 

0  1  1 

2  2 

O  1  1 

W  2  2 

\ 

/ 

Orthogonal  matrix 

ATA  =  1  =  aat 

1 

(  V2  a/2  ^ 

{^2  “72/ 

We  also  write  (ciy)  for  A  and  A{n  x  p )  to  indicate  the  numbers  of  rows  and 
columns.  Vectors  are  matrices  with  one  column  and  are  denoted  as  x  or  x(p  x  1). 
Special  matrices  and  vectors  are  defined  in  Table  2.1.  Note  that  we  use  small  letters 
for  scalars  as  well  as  for  vectors. 
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Matrix  Operations 


Elementary  operations  are  summarised  below: 

AJ  =  ( a# ) 

A  +  B  =  (fly  +  by) 

A  B  —  ( a  jj  by ) 

c  ■  A  =  (c  ■  ay) 


A-B 


A(n  x  p)  B(p  x  m)  =  C(n  x  m)  =  (cy) 


Properties  of  Matrix  Operations 


A  +  B  =  B  +  A 
A(B  +  C)  =  AB  + AC 
A(BC)  =  ( AB)C 
(^T)T  -  A 
(AB)r  =  BtAt 


Matrix  Characteristics 

Rank 

The  rank ,  rank(^4),  of  a  matrix  A(n  x  p )  is  defined  as  the  maximum  number  of 
linearly  independent  rows  (columns).  A  set  of  k  rows  a  j  of  A(n  x  p)  are  said  to 

be  linearly  independent  if  Y^j=\  cjaj  —  implies  Cj  =  0,  Vy ,  where  C\, ...  ,cp 
are  scalars.  In  other  words  no  rows  in  this  set  can  be  expressed  as  a  nontrivial  linear 
combination  of  the  (k  —  1)  remaining  rows. 
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2  A  Short  Excursion  into  Matrix  Algebra 


Trace 

The  trace  of  a  matrix  A(p  x  p )  is  the  sum  of  its  diagonal  elements 

p 

tr(v4)  =  Yaa. 
i  =  1 


Determinant 

The  determinant  is  an  important  concept  of  matrix  algebra.  For  a  square  matrix  A, 
it  is  defined  as: 


det(*4)  =  \A\  =  E(-d  ^  ^  ^lr(l)  •  •  •  tlpz(p) , 


the  summation  is  over  all  permutations  r  of  {1,2 and  |r|  =  0  if  the 
permutation  can  be  written  as  a  product  of  an  even  number  of  transpositions  and 
|  r  |  =  1  otherwise.  Some  properties  of  determinant  of  a  matrix  are: 

l*4T|  =  1-41 

\AB\  =  \A\  ■  \A\ 

\cA\  =  cn\A\. 


Example  2.1  In  the  case  of  p  —  2,  A  — 
“1”  and  “2”  once  or  not  at  all.  So, 


( tin  tin 
V  till  Cl22 


and  we  can  permute  the  digits 


A\  —  CL\\  a 22  —  (2\2  (22\  • 


Transpose 

For  A(n  x  p)  and  B(p  x  n) 

C4t)t  =  A,  and  (AB)T  =  BT AT . 


Inverse 

If  \A\  ^  0  and  A(p  x  p ),  then  the  inverse  A  1  exists: 

AA~'  =A~l  A  =  lp. 


2. 1  Elementary  Operations 
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For  small  matrices,  the  inverse  of  A  —  ( a y)  can  be  calculated  as 


C 

1 ~A\' 


where  C  =  (Cy)  is  the  adjoint  matrix  of  A.  The  elements  cp  of  CT  are  the  co-factors 
of  A: 


Cji  =  (-1) 


i  +j 


a  n 

...  ai(j-i) 

a  (/-i)i 

. . . 

a  (/+i)i 

.  .  .  d(j  H- 1 )  ( j  —  1 ) 

a  p\ 

. . .  ap(j~ i) 

fli(y+i) 


Ct\p 

&{i  —  1)  p 


a 


The  relationship  between  determinant  and  inverse  of  matrix  A  is  \A 


pp 

-l 


G-Inverse 

A  more  general  concept  is  the  G -inverse  (Generalised  Inverse)  A~  which  satisfies 
the  following: 


A  A  A  =  A. 


Later  we  will  see  that  there  may  be  more  than  one  G -inverse. 

Example  2.2  The  generalised  inverse  can  also  be  calculated  for  singular  matrices. 
We  have: 


1  0 
0  0 


1  0 
0  0 


1  0 
0  0 


1  0 
0  0 


which  means  that  the  generalised  inverse  of  A  — 


1  0 
0  0 


though  the  inverse  matrix  of  A  does  not  exist  in  this  case. 


is  A 


even 


Eigenvalues,  Eigenvectors 

Consider  a  (p  x  p)  matrix  A.  If  there  a  scalar  A  and  a  vector  y  exists  such  as 


Ay  =  Ay, 


(2.1) 
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2  A  Short  Excursion  into  Matrix  Algebra 


then  we  call 


A  an  eigenvalue 

y  an  eigenvector. 

It  can  be  proven  that  an  eigenvalue  A  is  a  root  of  the  p- th  order  polynomial  \A  — 
XI p  |  =  0.  Therefore,  there  are  up  to  p  eigenvalues  X\ ,  A2, . . . ,  Xp  of  A.  For  each 
eigenvalue  A j,  a  corresponding  eigenvector  yj  exists  given  by  Eq.  (2.1)  .  Suppose 
the  matrix  A  has  the  eigenvalues  X\ , . . . ,  Xp.  Let  A  =  diag(Ai , . . . ,  Xp). 

The  determinant  \A\  and  the  trace  tr(*4)  can  be  rewritten  in  terms  of  the 
eigenvalues: 


p 


\a\  =  \a\  =  U^ 

7=1 


(2.2) 


p 


tr(X)  =  tr(A)  =  j 

j= 1 


(2.3) 


An  idempotent  matrix  A  (see  the  definition  in  Table  2.1)  can  only  have  eigenvalues 
in  {0,  1}  therefore  tr(Al)  =  rank(^4)  =  number  of  eigenvalues  ^  0. 

(10°) 

Example  2.3  Let  us  consider  the  matrix  A  —  I  0  \  \  I .  It  is  easy  to  verify  that 

v°  \  \) 

A  A  —  A  which  implies  that  the  matrix  A  is  idempotent. 

We  know  that  the  eigenvalues  of  an  idempotent  matrix  are  equal  to  0  or  1 .  In  this 


case,  the  eigenvalues  of  A  are  A 1  =  1,A2  =  1,  and  A3  =  0  since 


y2  _  j  I  V 2 


and 


Using  formulas  (2.2)  and  (2.3),  we  can  calculate  the  trace  and  the  determinant 
of  A  from  the  eigenvalues:  tr(Al)  =  Ai  +  A2  +  A3  =  2,  \A\  —  A1A2A3  =  0,  and 
rank(^4)  =  2. 


Properties  of  Matrix  Characteristics 

A(n  x  n),  B(n  x  n),  ceR 


tr(^4  +  B)  —  tr  A  +  tr  B 
tr(c^l)  —  c  tv  A 


(2.4) 

(2.5) 
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M|  =  cn\A\  (2.6) 

\AB\  =  \BA\  =  \A\\B\  (2.7) 

A(n  x  p),  B(p  x  n ) 

tv(A-B)  =  tr  (B-A)  (2.8) 

rank(*4)  <  min  (n ,  p ) 

rank(*4)  >  0  (2.9) 

ranlsM)  =  rank(*4T)  (2.10) 

ranlsMTv4)  =  rank  (.4)  (2.11) 

rank (A  +  £>)  <  rank(*4)  +  rank(£>)  (2.12) 

rankMB)  <  min(rankM),  rank(£>)}  (2.13) 

A(n  x  /?),  B(p  x  q),  C(q  xn) 

tr  (.ABC)  =  tr(SC^) 

=  tr(C^B)  (2.14) 

rankMBC)  =  rank(£>)  for  nonsingular  A,  C  (2.15) 


A(p  x  p) 


|^-‘|  -  |.4| 

rank(*4)  =  p 


-l 

if  and  only  if  A  is  nonsingular. 


(2.16) 

(2.17) 


I  Summary 

>  The  determinant  |*4.|  is  the  product  of  the  eigenvalues  of  A. 


The  inverse  of  a  matrix  A  exists  if  |*4|  ^  0. 


The  trace  tr(*4)  is  the  sum  of  the  eigenvalues  of  A. 


The  sum  of  the  traces  of  two  matrices  equals  the  trace  of  the  sum 
of  the  two  matrices. 


The  trace  tr (AB)  equals  tr  (BA). 


The  rank(*4)  is  the  maximal  number  of  linearly  independent  rows 
(columns)  of  A. 
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2  A  Short  Excursion  into  Matrix  Algebra 


2.2  Spectral  Decompositions 

The  computation  of  eigenvalues  and  eigenvectors  is  an  important  issue  in  the 
analysis  of  matrices.  The  spectral  decomposition  or  Jordan  decomposition  links  the 
structure  of  a  matrix  to  the  eigenvalues  and  the  eigenvectors. 

Theorem  2.1  (Jordan  Decomposition)  Each  symmetric  matrix  A(p  x  p )  can  he 
written  as 


p 

A  =  r  ATt  =  J2XjYjYj  (2.18) 

j=  1 


where 


A  =  diag(Ai, .  ,.,XP) 

and  where 

r  =  (yi,y2>---’Yp) 

is  an  orthogonal  matrix  consisting  of  the  eigenvectors  y j  of  A. 

Example  2.4  Suppose  that  A  —  The  eigenvahies  are  found  by  solving  \A  — 

AX  |  =  0.  This  is  equivalent  to 


2  3-A  =  C1  —  A)(3  -  A) -4  =  0. 

Hence,  the  eigenvalues  are  X\  =  2  +  V5  and  A2  =  2  —  y/~5.  The  eigenvectors  are 
yx  —  (0.5257, 0.8506)T  and  y2  =  (0.8506,  — 0.5257)T.  They  are  orthogonal  since 

yJyi  =  0. 

Using  spectral  decomposition,  we  can  define  powers  of  a  matrix  A(p  x  p). 
Suppose  A  is  a  symmetric  matrix  with  positive  eigenvalues.  Then  by  Theorem  2.1 

A  =  TArT, 


and  we  define  for  some  a  e  R 


Aa  =  rA“rT, 


(2.19) 


2.2  Spectral  Decompositions 
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where  A"  =  diag(A“, . . . ,  A“).  In  particular,  we  can  easily  calculate  the  inverse  of 
the  matrix  A.  Suppose  that  the  eigenvalues  of  A  are  positive.  Then  with  a  =  —  1, 
we  obtain  the  inverse  of  A  from 


A~l  =  rA_1rT.  (2.20) 

Another  interesting  decomposition  which  is  later  used  is  given  in  the  following 
theorem. 

Theorem  2.2  (Singular  Value  Decomposition)  Each  matrix  A(n  x  p )  with  rank  r 
can  be  decomposed  as 


A=  r  A  at, 


where  T(n  x  r)  and  A (p  x  r).  Both  T  and  A  are  column  orthonormal,  i.e.  TTr  = 
AtA  =  Xr  and  A  =  diag  ^A|/2,  . . . ,  A j  >  0.  The  values  Ai, . . . ,  Ar  are 

the  nonzero  eigenvalues  of  the  matrices  AAT  and  ATA.  F  and  A  consist  of  the 
corresponding  r  eigenvectors  of  these  matrices. 


This  is  obviously  a  generalisation  of  Theorem  2.1  (Jordan  decomposition).  With 
Theorem  2.2,  we  can  find  a  G-inverse  A~  of  A.  Indeed,  define  AT  —  A  A-1  Tt. 
Then  A  A~  A  —  T  A  AT  =  A.  Note  that  the  G-inverse  is  not  unique. 

Example  2.5  In  Example  2.2,  we  showed  that  the  generalised  inverse  of  A  = 

is  A~  (  q  q  j  •  The  following  also  holds 


1  0 
00 


which  means  that  the  matrix 


is  also  a  generalised  inverse  of  A. 


Summary 

^  The  Jordan  decomposition  gives  a  representation  of  a  symmetric 
matrix  in  terms  of  eigenvalues  and  eigenvectors. 

The  eigenvectors  belonging  to  the  largest  eigenvalues  indicate  the 
“main  direction”  of  the  data. 
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Summary  (continued) 

^  The  Jordan  decomposition  allows  one  to  easily  compute  the  power 
of  a  symmetric  matrix  A:  Aa  —  rA°TT. 

The  singular  value  decomposition  (SVD)  is  a  generalisation  of  the 
Jordan  decomposition  to  non-quadratic  matrices. 


2.3  Quadratic  Forms 

A  quadratic  form  Q(x )  is  built  from  a  symmetric  matrix  A(p  x  p )  and  a  vector 

x  G  Rp : 


p  p 

Q(x)  —  xT  Ax  —  ^ ^2aijxixj •  (2.21) 

i  =  1 j  =  1 


Definiteness  of  Quadratic  Forms  and  Matrices 

GW  >  0  for  all  x  ^  0  positive  definite 

£?(*)  >  0  for  all  x  /  0  positive  semidefinite 

A  matrix  A  is  called  positive  definite  (semidefinite)  if  the  corresponding  quadratic 
form  Q{ .)  is  positive  definite  (semidefinite).  We  write  A  >  0  (>  0). 

Quadratic  forms  can  always  be  diagonalised,  as  the  following  result  shows. 

Theorem  2.3  If  A  is  symmetric  and  Q(x )  =  xT  Av  is  the  corresponding  quadratic 
form,  then  there  exists  a  transformation  x  ^  Ttx  =  j  such  that 

p 

xT  Ax  = 

i  =  1 


where  A/  are  the  eigenvalues  of  A. 

Proof  A  —  T  A  Tt.  By  Theorem  2.1  and  y  —  rTaf  we  have  that  xTAx  = 
xTrArTx  =  yJAy  =  £f=1  A,-  _y? .  □ 

Positive  definiteness  of  quadratic  forms  can  be  deduced  from  positive  eigenval¬ 
ues. 

Theorem  2.4  A  >  0  if  and  only  if  all  A/  >  0,  i  =  1, . . . ,  p. 

Proof  0  <  Ai y\  + - j-  A py2p  —  xTAx  for  all  x  ^  0  by  Theorem  2.3.  □ 


2.3  Quadratic  Forms 
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Corollary  2.1  If  A  >  0,  then  A  1  exists  and  \A\  >  0. 

Example  2.6  The  quadratic  form  Q(x )  =  xf  +  x\  corresponds  to  the  matrix  A  — 
^  ^  with  eigenvalues  X\  =  A  2  =  1  and  is  thus  positive  definite.  The  quadratic 

form  Q(x)  =  (x\  —  X2)2  corresponds  to  the  matrix  A  —  ^  with  eigenvalues 

X\  —  2,Xi  —  0  and  is  positive  semidefinite.  The  quadratic  form  Q(x)  —  x\  —  x\ 
with  eigenvalues  X\  =  l,  X2  =  —  1  is  indefinite. 

In  the  statistical  analysis  of  multivariate  data,  we  are  interested  in  maximising 
quadratic  forms  given  some  constraints. 

Theorem  2.5  If  A  and  B  are  symmetric  and  B  >  0,  then  the  maximum  of  xJy^x  A 
given  hy  the  largest  eigenvalue  ofB~lA.  More  generally, 


max 

X 


xT  Ax 
xTBx 


-  X\  >  X2  ^  *  1  ^  An 


=  min 

X 


xT  Ax 
xTBx  ’ 


where  X\, . . . ,  Xp  denote  the  eigenvalues  of  B  lA.  The  vector  which  maximises 
( minimises )  is  the  eigenvector  of  B~l  A  which  corresponds  to  the  largest 

(smallest)  eigenvalue  of  B~l  A.  If  xT  Bx  =  1,  we  get 

maxxTAv  =  X\  >  X2  >  •  •  •  >  Xp  —  minxTAx: 

X  X 


Proof  Denote  norm  of  vector  x  as  \\x 
Tb  A^/2  Tj  is  symmetric.  Then  x1  Bx 


Bl/2x 

P17^] 


,  then 


=  V xTx.  By  definition,  B 172  = 
tT#1/2||2  =  Bl!2x  2 .  Set  y  — 


max  X  =  max  yTB  172  AB  172y.  (2.22) 

x  XTBx  {y:yTy  =  1}^ 


From  Theorem  2.1,  let 


B~l/2  A  B~l/2  =  r  a  rT 

be  the  spectral  decomposition  of  B~^2  A  B~{^2.  Set 

z  =  rTy,  then zTz  =  yTT  Tt  y  =  yTy. 


Thus  (2.22)  is  equivalent  to 


p 

max  zT  A  z  =  max  Az-z2. 

{z:zTz=l}  {z:zTz=l}  j 
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But 


max  X\  max  £z?  =  Ai. 

' - - V - ' 

=  1 

The  maximum  is  thus  obtained  by  z  —  (1,0,...,  0)T,  i.e. 

y  =  yl9  hence x  =  B~i^2yi. 

Since  B~x  A  and  B~lA  A  B~lA  have  the  same  eigenvalues,  the  proof  is  complete. 

To  maximise  (minimise)  xT  Ax  under  xTBx  =  1,  below  is  another  proof  using 
the  Lagrange  method. 

ma xxT Ax  —  max[xT Ax  —  X  (xT Bx  —  l)]. 

X  x  v  7 

The  first  derivative  of  it  in  respect  to  x  is  equal  to  0: 

2Av  —  2XBx  —  0. 


so 


B  1  Ax  —  Ax 


By  the  definition  of  eigenvector  and  eigenvalue,  our  maximiser  x*  is  B  !^l’s 
eigenvector  corresponding  to  eigenvalue  A.  So 

max  xTAlx  =  max  xT£>£>-1Alx  =  max  xT^Ax  =  max  A 

{x:xT  Bx  =  1}  {x:xT  Bx  =  1}  {x:xT  Bx  =  1} 


which  is  just  the  maximum  eigenvalue  of  B  lA,  and  we  choose  the  corresponding 
eigenvector  as  our  maximiser  x* .  □ 


Example  2. 7  Consider  the  matrices  A  — 


and 


we  calculate  B  1 A 


The  biggest  eigenvalue  of  the  matrix  B  1 A  is  2  + 


y/5.  This  means  that  the  maximum  of  xTAlx  under  the  constraint  xT£>x  =  1  is 
2  +  \f5.  Notice  that  the  constraint  xT£>x  =  1  corresponds  to  our  choice  of  B ,  to  the 
points  which  lie  on  the  unit  circle  x\  +  x\  —  1 . 


2.4  Derivatives 
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'  Summary 

^  A  quadratic  form  can  be  described  by  a  symmetric  matrix  A. 

^  Quadratic  forms  can  always  be  diagonalised. 

^  Positive  definiteness  of  a  quadratic  form  is  equivalent  to  positive¬ 
ness  of  the  eigenvalues  of  the  matrix  A. 

The  maximum  and  minimum  of  a  quadratic  form  given  some 
constraints  can  be  expressed  in  terms  of  eigenvalues. 


2.4  Derivatives 

For  later  sections  of  this  book,  it  will  be  useful  to  introduce  matrix  notation  for 
derivatives  of  a  scalar  function  of  a  vector  x,  i.e.  / (x),  with  respect  to  x.  Consider 
/  :  Rp  ->  R  and  a  (p  x  1)  vector  x,  then  is  the  column  vector  of  partial 

derivatives  j  j  ,  j  —  l, p  and  is  the  row  vector  of  the  same  derivative 
cahcd  the  gradient  of  /) . 

We  can  also  introduce  second  order  derivatives:  |^r  is  the  (p  x  p)  matrix  of 
elements  A  —  1 , . . . ,  p  and  j  —  1 , . . . ,  p  {jpppr  is  called  the  Hessian  of  /) . 
Suppose  that  a  is  a  (p  x  1)  vector  and  that  A  —  AT  is  a  (p  x  p)  matrix.  Then 


daTx  dx1 


a 


dx 


dx 


=  a , 


(2.23) 


dxT  Ax 
dx 


=  2  Ax, 


(2.24) 


The  Hessian  of  the  quadratic  form  Q{x)  —  xT  Ax  is 


d2xT  Ax 
9x9xT 


=  2  A. 


(2.25) 


Example  2.8  Consider  the  matrix 
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2  A  Short  Excursion  into  Matrix  Algebra 


From  formulas  (2.24)  and  (2.25)  it  immediately  follows  that  the  gradient  of  Q(x )  = 
xT Ax  is 


dxT  Ax 
dx 


2  Ax  =  2 


f  2x  4x 
\4x  6x 


and  the  Hessian  is 


d2xT Ax  (\  2\ 

dxdxT  -2A~2\23) 


2.5  Partitioned  Matrices 

Very  often  we  will  have  to  consider  certain  groups  of  rows  and  columns  of  a  matrix 
A(n  x  p).  In  the  case  of  two  groups,  we  have 

A=(Aau  ^12V 

V  ^21  *4.22  ) 

wher q  Aijirii  x  pj),  i,j  =  1, 2,  n\  +  ri2  =  n  and  p\  +  p2  —  p . 

If  B{n  x  p)  is  partitioned  accordingly,  we  have: 

A  +  n  —  ( An  4-  -4i2  \ 

V  -421  +  £>21  *422  +  B22  J 

bt  =  (BJ1B?1\ 

V  bJ2  ) 

ajvT  _  ( AnBl  +  AuBj2  AnBji  +  AnBj2\ 

\A21BI  +  *422^12  -421  Sj  +  A22BJ2) 

An  important  particular  case  is  the  square  matrix  A(p  x  p ),  partitioned  in  such  a 
way  that  An  and  4.22  are  both  square  matrices  (i.e.  nj  —  pj ,  j  —  1,2).  It  can  be 
verified  that  when  A  is  non-singular  (44“ 1  =  Xp)\ 

a—\  _  (An  Au\ 

A  —  22  J  (2.26) 


where 


An  —  (An  —A12A22A2 1)_1  =  (4ii.2)_1 

A12  =  —(Ana)-1  A12A22 

A21  =  — 42242i(4n.2)_1 

A22  =  A22  +  422  *421  (*4.11-2)  1*4l2*4221 


2.5  Partitioned  Matrices 
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An  alternative  expression  can  be  obtained  by  reversing  the  positions  of  A\  \  and  A22 
in  the  original  matrix. 

The  following  results  will  be  useful  if  An  is  non-singular: 

\A\  =  \An  1 1^4.22  —  A2iAnl  An\  =  I-A11 1 1-A22-1 1  -  (2.27) 


If  A22  is  non-singular,  we  have  that: 

\A\  —  |  ^4-22  1 1  A\ l  —  A\2A2l  A2\  I  =  1*4.22!  1*4.1 1-2 


(2.28) 


A  useful  formula  is  derived  from  the  alternative  expressions  for  the  inverse  and 
the  determinant.  For  instance  let 


T 


B  = 


1  b 
a  A 


where  a  and  b  are  (p  x  1)  vectors  and  A  is  non-singular.  We  then  have: 


\B\  —  \A  —  obT |  =  \A\\l  —  b  1  A  la 


T  a  — 1 


(2.29) 


and  equating  the  two  expressions  for  B22,  we  obtain  the  following: 


(A  —  abT )  1  =  A  1  + 


-1 


A  labT  A  1 
1  —  bT  A~la 


(2.30) 


Example  2.9  Let’s  consider  the  matrix 


A  = 


1  2 
2  2 


We  can  use  formula  (2.26)  to  calculate  the  inverse  of  a  partitioned  matrix,  i.e.  An 
—  1 ,  A12  =  A21  =  1 ,  A22  =  —1/2.  The  inverse  of  A  is 


^r1  = 


-1  1 
1  -0.5 


It  is  also  easy  to  calculate  the  determinant  of  A: 

\A\  =  1 1 1 12  —  4|  =  -2. 

Let  A(n  x  p )  and  B(p  x  n)  be  any  two  matrices  and  suppose  that  n  >  p. 
From  (2.27)  and  (2.28)  we  can  conclude  that 


-kin  -A 


B  1 


p 


=  ( -X)n~p\BA-Xlp\  =  \AB  —  XXn 


(2.31) 
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Since  both  determinants  on  the  right-hand  side  of  (2.31)  are  polynomials  in  A,  we 
find  that  the  n  eigenvalues  of  AB  yield  the  p  eigenvalues  of  BA  plus  the  eigenvalue 
0 ,n  —  p  times. 

The  relationship  between  the  eigenvectors  is  described  in  the  next  theorem. 

Theorem  2.6  For  A(n  x  p)  and  B(p  xn),  the  nonzero  eigenvalues  of  AB  and  BA 
are  the  same  and  have  the  same  multiplicity.  If  x  is  an  eigenvector  of  AB  for  an 
eigenvalue  A  ^  0,  then  y  —  Bx  is  an  eigenvector  of  BA. 

Corollary  2.2  For  A(n  x  p),  B(q  x  n),  a(p  x  1),  and  b{q  x  1)  we  have 

Yank(AabT  B)  <  1. 

The  nonzero  eigenvalue,  if  it  exists,  equals  bTBAa  (with  eigenvector  Aa). 

Proof  Theorem  2.6  asserts  that  the  eigenvalues  of  AabT B  are  the  same  as  those  of 
bTBAa.  Note  that  the  matrix  bTBAa  is  a  scalar  and  hence  it  is  its  own  eigenvalue 
Ai . 

Applying  AabT  B  to  Aa  yields 

(AabT  B)(Aa)  —  (Aa)(bT  BAa)  =  X\Aa. 


□ 


2.6  Geometrical  Aspects 
Distance 

Let  x,  y  distance  d  is  defined  as  a  function 

{d(x,  y)  >  0  Wx  A  y 

d(x,  y)  =  0  if  and  only  if  x  =  y  . 

d(x,y )  <  d(x,z)  +  d(z,y )  Vx,y,z 

A  Euclidean  distance  d  between  two  points  x  and  y  is  defined  as 

d2(x,  y)  =  (x  -  y)T A(x  -  y)  (2.32) 


where  A  is  a  positive  definite  matrix  (A  >  0).  A  is  called  a  metric. 


2.6  Geometrical  Aspects 
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Fig.  2.1  Distance  d 


Fig.  2.2  Iso-distance  sphere 


Fig.  2.3  Iso-distance 
ellipsoid 


Example  2.10  A  particular  case  is  when  A  =  Tp,  i.e. 

p 

d2(x,  y  )  =  ^  (*;  -  Ji)2-  (2-33) 

i  =  1 

Figure  2.1  illustrates  this  definition  for  p  —  2. 

Note  that  the  sets  Ed  —  {x  e  |  (x  —  xo)T(x  —  xo)  =  d2}  ,  i.e.  the  spheres 
with  radius  d  and  centre  Xo,  are  the  Euclidean  Xp  iso-distance  curves  from  the  point 
xo  (see  Fig.  2.2). 

The  more  general  distance  (2.32)  with  a  positive  definite  matrix  A  (A  >  0)  leads 
to  the  iso-distance  curves 

Ed  —  {x  e  Rp  |  (x  —  x0)Txf(x  —  xo)  =  d2},  (2.34) 

i.e.  ellipsoids  with  centre  Xq,  matrix  A  and  constant  d  (see  Fig.  2.3). 
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Let  yi,  y2, . . . ,  yp  be  the  orthonormal  eigenvectors  of  A  corresponding  to  the 
eigenvalues  X\  >  A  2  >  •  •  •  >  Xp.  The  resulting  observations  are  given  in  the  next 
theorem. 

Theorem  2.7  ( i )  The  principal  axes  of  Ed  are  in  the  direction  of  Yi\  l  — 

1>  •  •  •  > P- 

(ii)  The  half-lengths  of  the  axes  are  i  —  1, . . . ,  p. 

(iii)  The  rectangle  surrounding  the  ellipsoid  Ed  is  defined  by  the  following 
inequalities: 

xoi  —  V d2au  <  Xi  <  xoi  +  V d2au,  i  —  1, . . . ,  p, 

where  a11  is  the  ( i ,  i)  element  of  A~l .  By  the  rectangle  surrounding  the  ellipsoid 
Ed  we  mean  the  rectangle  whose  sides  are  parallel  to  the  coordinate  axis. 

It  is  easy  to  find  the  coordinates  of  the  tangency  points  between  the  ellipsoid  and 
its  surrounding  rectangle  parallel  to  the  coordinate  axes.  Let  us  find  the  coordinates 
of  the  tangency  point  that  are  in  the  direction  of  the  j  -th  coordinate  axis  (positive 
direction). 

For  ease  of  notation,  we  suppose  the  ellipsoid  is  centred  around  the  origin  (xo  = 
0).  If  not,  the  rectangle  will  be  shifted  by  the  value  of  xq. 

The  coordinate  of  the  tangency  point  is  given  by  the  solution  to  the  following 
problem: 


x  =  arg  max  e  x 

xT  Ax=d2 


(2.35) 


where  ej  is  the  7-th  column  of  the  identity  matrix  Tp.  The  coordinate  of  the 
tangency  point  in  the  negative  direction  would  correspond  to  the  solution  of  the 
min  problem:  by  symmetry,  it  is  the  opposite  value  of  the  former. 

The  solution  is  computed  via  the  Lagrangian  L  —  ejx  —  A(xT  Av  —  d2)  which 
by  (2.23)  leads  to  the  following  system  of  equations: 


dL 

dx 


—  ejt  —  2A  Ax  —  0 


dL 

9A 


=  xtAx  —  d2  —  0. 


This  gives  x  —  ^ A  lej ,  or  componentwise 


(2.36) 

(2.37) 


Xj 


2A 


alf  i  =  1, . . . ,  p 


(2.38) 


where  aij  denotes  the  ( i ,  j)- th  element  of  A  1 
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Premultiplying  (2.36)  by  xT,  we  have  from  (2.37): 

Xj  =  2A  d2. 


Comparing  this  to  the  value  obtained  by  (2.38),  for  i  —  j  we  obtain  2 A  = 

We  choose  the  positive  value  of  the  square  root  because  we  are  maximising  ej x.  A 
minimum  would  correspond  to  the  negative  value.  Finally,  we  have  the  coordinates 
of  the  tangency  point  between  the  ellipsoid  and  its  surrounding  rectangle  in  the 
positive  direction  of  the  y-th  axis: 


(2.39) 


The  particular  case  where  i  =  j  provides  statement  (iii)  in  Theorem  2.7. 


Remark:  Usefulness  of  Theorem  2. 7 

Theorem  2.7  will  prove  to  be  particularly  useful  in  many  subsequent  chapters.  First, 
it  provides  a  helpful  tool  for  graphing  an  ellipse  in  two  dimensions.  Indeed,  knowing 
the  slope  of  the  principal  axes  of  the  ellipse,  their  half-lengths  and  drawing  the 
rectangle  inscribing  the  ellipse,  allows  one  to  quickly  draw  a  rough  picture  of  the 
shape  of  the  ellipse. 

In  Chap.  7,  it  is  shown  that  the  confidence  region  for  the  vector  /to  fa  multivariate 
normal  population  is  given  by  a  particular  ellipsoid  whose  parameters  depend 
on  sample  characteristics.  The  rectangle  inscribing  the  ellipsoid  (which  is  much 
easier  to  obtain)  will  provide  the  simultaneous  confidence  intervals  for  all  of  the 
components  in  /z. 

In  addition  it  will  be  shown  that  the  contour  surfaces  of  the  multivariate  normal 
density  are  provided  by  ellipsoids  whose  parameters  depend  on  the  mean  vector 
and  on  the  covariance  matrix.  We  will  see  that  the  tangency  points  between  the 
contour  ellipsoids  and  the  surrounding  rectangle  are  determined  by  regressing  one 
component  on  the  (p  —  1)  other  components.  For  instance,  in  the  direction  of  the 
j  -th  axis,  the  tangency  points  are  given  by  the  intersections  of  the  ellipsoid  contours 
with  the  regression  line  of  the  vector  of  (p  —  1)  variables  (all  components  except 
the  j  -th)  on  the  j  -th  component. 
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Norm  of  a  Vector 

Consider  a  vector  x  e  IRC  The  norm  or  length  of  x  (with  respect  to  the  metric  Xp) 
is  defined  as 


x 


=  d(Op,x )  =  V. 


XTX. 


If  II  x ||  =  l,x  is  called  a  unit  vector .  A  more  general  norm  can  be  defined  with 
respect  to  the  metric  A: 


x 


A 


=  V  xT  Ax. 


Angle  Between  Two  Vectors 

Consider  two  vectors  x  and  y  £  IRC  The  angle  0  between  x  and  y  is  defined  by  the 
cosine  of  6: 


cos  6  — 


see  Fig.  2.4.  Indeed  for  p  =  2,  x  = 


we  have 


x| 

X 


cos  6 1 
sin  0\ 


=  ; 
=  *2 ; 


y  ||  cos  $2  =  y i 
J II  sin  02  =  J2, 


(2.40) 


(2.41) 


therefore, 


cos  6  —  cos  0 1  cos  $2  +  sin  0\  sin  02  — 


xiyi  +  x2y2 


x 


y 


y 


Fig.  2.4  Angle  between 
vectors 
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Fig.  2.5  Projection 


Remark  2.1  If  xT  y  —  0,  then  the  angle  0  is  equal  to  — .  From  trigonometry,  we 

! 2 \ 

know  that  the  cosine  of  6  equals  the  length  of  the  base  of  a  triangle  (|  \px  |  |)  divided 
by  the  length  of  the  hypotenuse  (|  \x\  |).  Hence,  we  have 


7 r 


cos#|  = 


(2.42) 


where  px  is  the  projection  of  x  on  y  (which  is  defined  below).  It  is  the  coordinate 
of  x  on  the  y  vector,  see  Fig.  2.5. 

The  angle  can  also  be  defined  with  respect  to  a  general  metric  A 


cos  0  — 


xTAy 

x\ U  IIjIU’ 


(2.43) 


If  cos  6  —  0  then  x  is  orthogonal  to  y  with  respect  to  the  metric  A. 

Example  2.11  Assume  that  there  are  two  centred  (i.e.  zero  mean)  data  vectors.  The 
cosine  of  the  angle  between  them  is  equal  to  their  correlation  (defined  in  (3.8)). 
Indeed  for  x  and  y  with  x  =  y  =  0  we  have 


rxY 


E  Xj  yi 

JyAyA 


—  cos  0 


according  to  formula  (2.40). 


Rotations 

When  we  consider  a  point  x  eM^we  generally  use  a  /^-coordinate  system  to  obtain 
its  geometric  representation,  like  in  Fig.  2.1  for  instance.  There  will  be  situations  in 
multivariate  techniques  where  we  will  want  to  rotate  this  system  of  coordinates  by 
the  angle  0 . 

Consider  for  example  the  point  P  with  coordinates  x  =  (xi,X2)t  in  M2  with 
respect  to  a  given  set  of  orthogonal  axes.  Let  T  be  a  (2  x  2)  orthogonal  matrix 
where 


r  = 


cos#  sin# 
—  sin  0  cos  0 


(2.44) 


74 


2  A  Short  Excursion  into  Matrix  Algebra 


If  the  axes  are  rotated  about  the  origin  through  an  angle  0  in  a  clockwise  direction, 
the  new  coordinates  of  P  will  be  given  by  the  vector  y 

y  =  r  X,  (2.45) 

and  a  rotation  through  the  same  angle  in  a  anti-clockwise  direction  gives  the  new 
coordinates  as 


y  =  rTx.  (2.46) 

More  generally,  premultiplying  a  vector  x  by  an  orthogonal  matrix  T  geomet¬ 
rically  corresponds  to  a  rotation  of  the  system  of  axes,  so  that  the  first  new  axis  is 
determined  by  the  first  row  of  T .  This  geometric  point  of  view  will  be  exploited  in 
Chaps.  11  and  12. 


Column  Space  and  Null  Space  of  a  Matrix 

Define  for  X(n  x  p) 


\m(X)  =  C(X)  =  {x  G  IT  |  3a  e  so  that  Xa  =  x}, 

the  space  generated  by  the  columns  of  X  or  the  column  space  of  X.  Note  that 
C(X)  c  W1  and  dim {C(X)}  —  rank(Af)  —  r  <  min (n,  p). 


Ker(^)  =  N(X)  =  {y  G  Rp  \  Xy  =  0} 


is  the  null  space  of  Note  that  N(X)  c  Rp  and  that  dim{A^(7b)}  =  p  —  r. 

Remark  2.2  N(XT)  is  the  orthogonal  complement  of  C(X)  in  M77,  i.e.  given  a 
vector  b  e  M/?  it  will  hold  that  xTb  =  0  for  all  x  G  C(X),  if  and  only  if  b  e  N(XT). 


Example  2.12  Let 


/235\ 
A  61 
6  8  6 
\8  2  4/ 


.  It  is  easy  to  show  (e.g.  by  calculating  the 


determinant  of  X)  that  rank(7b)  =  3.  Hence,  the  column  space  of  X  is  C(X)  —  M3. 
The  null  space  of  X  contains  only  the  zero  vector  (0,  0,  0)T  and  its  dimension  is 
equal  to  rank(T')  —  3  =  0. 
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For 


[23  1\ 
4  6  2 
6  8  3 
\8  2  4/ 


the  third  column  is  a  multiple  of  the  first  one  and  the 


matrix  X  cannot  be  of  full  rank.  Noticing  that  the  first  two  columns  of  7b  are 
independent,  we  see  that  rank(A)  =  2.  In  this  case,  the  dimension  of  the  columns 
space  is  2  and  the  dimension  of  the  null  space  is  1 . 


Projection  Matrix 

A  matrix  V(n  x  n)  is  called  an  (orthogonal)  projection  matrix  in  M77  if  and  only  if 
V  —  VT  =  V2  ( V  is  idempotent).  Let  b  el".  Then  a  —  Vb  is  the  projection  of  b 
on  C(V). 


Projection  on  C(X) 

Consider  X(n  x  p)  and  let 


V  =  X(XTX)~1XT  (2.47) 

and  Q  —  Xn  —  V.  It’s  easy  to  check  that  V  and  Q  are  idempotent  and  that 

VX  —  X  and  QX  =  0.  (2.48) 

Since  the  columns  of  A  are  projected  onto  themselves,  the  projection  matrix  V 
projects  any  vector  b  e  M77  onto  C(X).  Similarly,  the  projection  matrix  Q  projects 
any  vector  b  e  M77  onto  the  orthogonal  complement  of  C(X). 

Theorem  2.8  Let  V  be  the  projection  ( 2.47 j  and  Q  its  orthogonal  complement. 
Then: 

(i)  x  =  Vb  entails  x  e  C(X), 

(ii)  y  =  Qb  means  that  yT x  —  0  Vx  G  C(X). 

Proof  (i)  holds,  since x  =  X(XTX)~lXTb  —  Xa,wherea  —  (XTX)~lXTb  e 

Rp. 

(ii)  follows  from  y  —  b  —  Vb  and  x  =  Xa.  Hence  yTx  —  bTXa  — 
bTX(XTX)~lXTXa  =  0. 

□ 
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Remark  2.3  Let  x,y  e  W1  and  consider  px  e  M7\  the  projection  of  i  on  j  (see 
Fig.  2.5).  With  ^  =  j  we  have  from  (2.47) 


P: 


/  T  \-l  T  3;Tx 

=  y(y  y)  y  x  =  t-tt  y 

A 


(2.49) 


and  we  can  easily  verify  that 


See  again  Remark  2. 1 . 


Summary 


A  distance  between  two  p -dimensional  points  x  and  y  is  a 
quadratic  form  (x  —  y)T A(x  —  y)  in  the  vectors  of  differences 
(x  —  y).  A  distance  defines  the  norm  of  a  vector. 


Iso-distance  curves  of  a  point  xo  are  all  those  points  that  have  the 
same  distance  from  x$.  Iso-distance  curves  are  ellipsoids  whose 
principal  axes  are  determined  by  the  direction  of  the  eigenvectors 
of  A.  The  half-length  of  principal  axes  is  proportional  to  the  inverse 
of  the  roots  of  the  eigenvalues  of  A. 


The  angle  between  two  vectors  x  and  y  is  given  by  cos  0  = 

ii  n  w.r.t.  the  metric  A. 

\\x\\a 


a 


For  the  Euclidean  distance  with  A  =  I  the  correlation  between 
two  centred  data  vectors  x  and  y  is  given  by  the  cosine  of  the  angle 
between  them,  i.e.  cos  0  —  rXy ■ 


- - 7 - —  1 

The  projection  V  —  X(XTX)~lXT  is  the  projection  onto  the 
column  space  C(X)  of  A. 


The  projection  of  x  e  M77  on  y  e  M77  is  given  by  px  — 


-  ^A_y, 


2.7  Exercises 

Exercise  2.1  Compute  the  determinant  for  a  ( 3x3)  matrix. 

Exercise  2.2  Suppose  that\A\  =  0.  Is  it  possible  that  all  eigenvalues  of  A  are 
positive  ? 
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Exercise  2.3  Suppose  that  all  eigenvalues  of  some  ( square )  matrix  A  are  different 
from  zero.  Does  the  inverse  A~l  of  A  exist? 

Exercise  2.4  Write  a  program  that  calculates  the  Jordan  decomposition  of  the 
matrix 


Check  Theorem  2. 1  numerically. 


Exercise  2.5  Prove  (2.23),  (2.24)  and  (2.25). 

Exercise  2.6  Show  that  a  projection  matrix  only  has  eigenvalues  in  {0, 1}. 

Exercise  2.7  Draw  some  iso-distance  ellipsoids  for  the  metric  A  —  E-1  of 
Example  3.13. 


Exercise  2.8  Find  a  formula  for  \A  +  aaT  \  and  for  (*4  +  aa  1  )  1 .  (Hint:  use  the 


,T\  — 1 


inverse  partitioned  matrix  with  B  — 


1  —a 
a  A 


T 


J 


Exercise  2.9  Prove  the  Binomial  inverse  theorem  for  two  non-singular  matrices 
A(p  x  p)  and  B(p  x  p):  (^4  +  B)~l  —  A~l  —  *4-1(Vl-1  +  B~l)~l  A~l .  (Hint:  use 


(2.26)  with  C  — 


A  Ip 
-IP  B~l 


■) 


Chapter  3 

Moving  to  Higher  Dimensions 


We  have  seen  in  the  previous  chapters  how  very  simple  graphical  devices  can  help  in 
understanding  the  structure  and  dependency  of  data.  The  graphical  tools  were  based 
on  either  univariate  (bivariate)  data  representations  or  on  “slick”  transformations 
of  multivariate  information  perceivable  by  the  human  eye.  Most  of  the  tools  are 
extremely  useful  in  a  modelling  step,  but  unfortunately,  do  not  give  the  full  picture 
of  the  data  set.  One  reason  for  this  is  that  the  graphical  tools  presented  capture 
only  certain  dimensions  of  the  data  and  do  not  necessarily  concentrate  on  those 
dimensions  or  sub-parts  of  the  data  under  analysis  that  carry  the  maximum  structural 
information.  In  Part  III  of  this  book,  powerful  tools  for  reducing  the  dimension  of 
a  data  set  will  be  presented.  In  this  chapter,  as  a  starting  point,  simple  and  basic 
tools  are  used  to  describe  dependency.  They  are  constructed  from  elementary  facts 
of  probability  theory  and  introductory  statistics  (e.g.  the  covariance  and  correlation 
between  two  variables). 

Sections  3.1  and  3.2  show  how  to  handle  these  concepts  in  a  multivariate  setup 
and  how  a  simple  test  on  correlation  between  two  variables  can  be  derived.  Since 
linear  relationships  are  involved  in  these  measures,  Sect.  3.4  presents  the  simple 
linear  model  for  two  variables  and  recalls  the  basic  /-test  for  the  slope.  In  Sect.  3.5, 
a  simple  example  of  one-factorial  analysis  of  variance  introduces  the  notations  for 
the  well-known  F-test. 

Due  to  the  power  of  matrix  notation,  all  of  this  can  easily  be  extended  to  a  more 
general  multivariate  setup.  Section  3.3  shows  how  matrix  operations  can  be  used  to 
define  summary  statistics  of  a  data  set  and  for  obtaining  the  empirical  moments  of 
linear  transformations  of  the  data.  These  results  will  prove  to  be  very  useful  in  most 
of  the  chapters  in  Part  III. 

Finally,  matrix  notation  allows  us  to  introduce  the  flexible  multiple  linear  model, 
where  more  general  relationships  among  variables  can  be  analysed.  In  Sect.  3.6,  the 
least  squares  adjustment  of  the  model  and  the  usual  test  statistics  are  presented  with 
their  geometric  interpretation.  Using  these  notations,  the  ANOVA  model  is  just  a 
particular  case  of  the  multiple  linear  model. 
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3  Moving  to  Higher  Dimensions 


3.1  Covariance 

Covariance  is  a  measure  of  dependency  between  random  variables.  Given  two 
(random)  variables  X  and  Y  the  (theoretical)  covariance  is  defined  by: 

=  Cov(X,  Y)  =  E (XT)  -  (E  X)(E  Y).  (3.1) 

The  precise  definition  of  expected  values  is  given  in  Chap.  4.  If  X  and  Y  are 
independent  of  each  other,  the  covariance  Cov(X,  Y)  is  necessarily  equal  to  zero, 
see  Theorem  3.1.  The  converse  is  not  true.  The  covariance  of  X  with  itself  is  the 
variance: 


=  Var(X)  =  Cov(X,  X). 


If  the  variable  X  is  p -dimensional  multivariate,  e.g.  X 


Xl\ 

'  ,  then  the 


\XpJ 

theoretical  covariances  among  all  the  elements  are  put  into  matrix  form,  i.e.  the 
covariance  matrix: 


crxiXi  •  •  •  VXiX 


\ 


p 


GXvXx  •  •  •  GXPXP  / 


Properties  of  covariance  matrices  will  be  detailed  in  Chap.  4.  Empirical  versions  of 
these  quantities  are: 


l  n  _ 

SXY  =  -  Vte  -  x)(yt  -  y) 

n  L ^ 

i  =  1 
1  7? 

=  -  (xi  -  x)2. 
n  L { 

i  =  1 


$XX 


(3.2) 

(3.3) 


For  small  n,  say  n  <  20,  we  should  replace  the  factor  2  in  (3.2)  and  (3.3)  by  in 
order  to  correct  for  a  small  bias.  For  a  ^-dimensional  random  variable,  one  obtains 
the  empirical  covariance  matrix  (see  Sect.  3.3  for  properties  and  details) 


(SXiXt  •  •  •  SXiX^  \ 

:  : 

sXpXx  ■  ■  ■  SxpXp  / 
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For  a  scatterplot  of  two  variables  the  covariances  measure  “how  close  the  scatter 
is  to  a  line”.  Mathematical  details  follow  but  it  should  already  be  understood  here 
that  in  this  sense  covariance  measures  only  “linear  dependence”. 

Example  3.1  If  X  is  the  entire  bank  data  set,  one  obtains  the  covariance  matrix  S 
as  indicated  below: 


0.14 

0.03 

0.02 

-0.10 

-0.01  0.08  \ 

0.03 

0.12 

0.10 

0.21 

0.10  -0.21 

0.02 

0.10 

0.16 

0.28 

0.12  -0.24 

-0.10 

0.21 

0.28 

2.07 

0.16  -1.03 

-0.01 

0.10 

0.12 

0.16 

0.64  -0.54 

0.08 

-0.21 

-0.24 

-1.03 

-0.54  1.32  J 

(3.4) 


The  empirical  covariance  between  X4  and  X5,  i.e.  Sx4x5,  is  found  in  row  4  and 
column  5.  The  value  is  Sx4x5  —  0.16.  Is  it  obvious  that  this  value  is  positive?  In 
Exercise  3.1  we  will  discuss  this  question  further. 

If  Xf  denotes  the  counterfeit  bank  notes,  we  obtain: 


(  0.123 

0.031 

0.023  -0.099  0.019  0.011  \ 

0.031 

0.064 

0.046  -0.024  -0.012  -0.005 

II 

0.024 

0.046 

0.088  -0.018  0.000  0.034 

-0.099 

-0.024 

-0.018  1.268  -0.485  0.236 

0.019 

-0.012 

0.000  -0.485  0.400  -0.022 

^  0.011 

-0.005 

0.034  0.236  -0.022  0.308  / 

For  the  genuine  Xg,  we  have: 

/  0.149 

0.057 

0.057  0.056  0.014  0.005\ 

0.057 

0.131 

0.085  0.056  0.048  -0.043 

S*  = 

0.057 

0.085 

0.125  0.058  0.030  -0.024 

0.056 

0.056 

0.058  0.409  -0.261  -0.000 

0.014 

0.049 

0.030  -0.261  0.417  -0.074 

\  0.005  - 

-0.043  - 

-0.024  -0.000  -0.074  0.198/ 

(3.5) 


(3.6) 


Note  that  the  covariance  between  X4  (distance  of  the  frame  to  the  lower  border) 
and  X5  (distance  of  the  frame  to  the  upper  border)  is  negative  in  both  (3.5)  and 

(3.6).  Why  would  this  happen?  In  Exercise  3.2  we  will  discuss  this  question  in  more 
detail. 

At  first  sight,  the  matrices  S /  and  Sg  look  different,  but  they  create  almost  the 
same  scatterplots  (see  the  discussion  in  Sect.  1.4).  Similarly,  the  common  principal 
component  analysis  in  Chap.  1 1  suggests  a  joint  analysis  of  the  covariance  structure 
as  in  Flury  and  Riedwyl  (1988). 
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Fig.  3.1  Scatterplot  of 
variables  X\  vs.  X5  of  the 
entire  bank  data  set  Q 
MVAscabank4  5 


Swiss  bank  notes 


Scatterplots  with  point  clouds  that  are  “upward-sloping”,  like  the  one  in  the 
upper  left  of  Fig.  1.14,  show  variables  with  positive  covariance.  Scatterplots  with 
“downward-sloping”  structure  have  negative  covariance.  In  Fig.  3.1  we  show  the 
scatterplot  of  X4  vs.  X5  of  the  entire  bank  data  set.  The  point  cloud  is  upward- 
sloping.  However,  the  two  sub-clouds  of  counterfeit  and  genuine  bank  notes  are 
downward-  sloping . 

Example  3.2  A  textile  shop  manager  is  studying  the  sales  of  “classic  blue” 
pullovers  over  ten  different  periods.  He  observes  the  number  of  pullovers  sold 
(X\),  variation  in  price  (X2,  in  EUR),  the  advertisement  costs  in  local  newspapers 
(A3,  in  EUR)  and  the  presence  of  a  sales  assistant  (X4,  in  hours  per  period).  Over 
the  periods,  he  observes  the  following  data  matrix: 

/  230  125  200  109  \ 

181  99  55  107 
165  97  105  98 
150  115  85  71 
_  97  120  0  82 

192  100  150  103 
181  80  85  111 
189  90  120  93 
172  95  110  86 
V  170  125  130  78/ 

He  is  convinced  that  the  price  must  have  a  large  influence  on  the  number  of  pullovers 
sold.  So  he  makes  a  scatterplot  of  X2  vs.  X\,  see  Fig. 3.2.  A  rough  impression 
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Fig.  3.2  Scatterplot  of 
variables  X^  vs.  X\  of  the 
pullovers  data  set  Q 
MVAscapulll 


Pullovers  Data 


100  110 
Price  (X2) 


is  that  the  cloud  is  somewhat  downward- sloping.  A  computation  of  the  empirical 
covariance  yields 


$X\X2  =  l  {Xu  -  X\)  (x2,  -  X2)  =  -80.02, 

i  =  1 

a  negative  value  as  expected. 

Note :  The  covariance  function  is  scale  dependent.  Thus,  if  the  prices  in  this 
example  were  in  Japanese  Yen  (JPY),  we  would  obtain  a  different  answer  (see 
Exercise  3.16).  A  measure  of  (linear)  dependence  independent  of  the  scale  is  the 
correlation,  which  we  introduce  in  the  next  section. 


Summary 

^  The  covariance  is  a  measure  of  dependence. 


Covariance  measures  only  linear  dependence. 


Covariance  is  scale  dependent. 


There  are  non-linear  dependencies  that  have  zero  covariance. 


Zero  covariance  does  not  imply  independence. 
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3  Moving  to  Higher  Dimensions 


Summary  (continued) 

Independence  implies  zero  covariance. 

Negative  covariance  corresponds  to  downward- sloping  scatter- 
plots. 

Positive  covariance  corresponds  to  upward- sloping  scatterplots. 

The  covariance  of  a  variable  with  itself  is  its  variance 
Cov(X,  X )  =  G xx  —  &X- 

For  small  n ,  we  should  replace  the  factor  ^  in  the  computation  of 
the  covariance  by  — -r . 

J  77  —  1 


3.2  Correlation 

The  correlation  between  two  variables  X  and  Y  is  defined  from  the  covariance  as 
the  following: 


_  Cover,  Y) 

PXY  sjVax(X)  Var(F) 

The  advantage  of  the  correlation  is  that  it  is  independent  of  the  scale,  i.e.  changing 
the  variables’  scale  of  measurement  does  not  change  the  value  of  the  correlation. 
Therefore,  the  correlation  is  more  useful  as  a  measure  of  association  between  two 
random  variables  than  the  covariance.  The  empirical  version  of  pXy  is  as  follows: 


rXY 


SXY 

*JSXX$YY 


(3.8) 


The  correlation  is  in  absolute  value  always  less  than  1 .  It  is  zero  if  the  covariance 
is  zero  and  vice  versa.  For  -dimensional  vectors  (X\, . . . ,  Xp)T  we  have  the 
theoretical  correlation  matrix 


Px  i*i 


•  •  •  PXiXp  \ 


pxvx  1 


and  its  empirical  version,  the  empirical  correlation  matrix  which  can  be  calculated 
from  the  observations, 


rxlxl 


rxpx  1 
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Example  3.3  We  obtain  the  following  correlation  matrix  for  the  genuine  bank 
notes: 


( 1.00 

0.41 

0.41 

0.22 

0.05  0.03  \ 

0.41 

1.00 

0.66 

0.24 

0.20  -0.25 

0.41 

0.66 

1.00 

0.25 

0.13  -0.14 

0.22 

0.24 

0.25 

1.00 

-0.63  -0.00 

0.05 

0.20 

0.13 

-0.63 

1.00  -0.25 

y  0.03 

-0.25 

-0.14 

-0.00 

-0.25  1.00/ 

and  for  the  counterfeit  bank  notes: 

/  1.00  0.35  0.24  -0.25  0.08  0.06  \ 

0.35  1.00  0.61  -0.08  -0.07  -0.03 

_  0.24  0.61  1.00  -0.05  0.00  0.20 

1  ~  -0.25  -0.08  -0.05  1.00  -0.68  0.37 

0.08  -0.07  0.00  -0.68  1.00  -0.06 

\  0.06  -0.03  0.20  0.37  -0.06  1.00/ 


(3.9) 


(3.10) 


As  noted  before  for  Cov(X4,  X5),  the  correlation  between  X4  (distance  of  the  frame 
to  the  lower  border)  and  X5  (distance  of  the  frame  to  the  upper  border)  is  negative. 
This  is  natural,  since  the  covariance  and  correlation  always  have  the  same  sign  (see 
also  Exercise  3.17). 

Why  is  the  correlation  an  interesting  statistic  to  study?  It  is  related  to  indepen¬ 
dence  of  random  variables,  which  we  shall  define  more  formally  later  on.  For  the 
moment  we  may  think  of  independence  as  the  fact  that  one  variable  has  no  influence 
on  another. 


Theorem  3.1  If  X  and  Y  are  independent,  then  p(X,  Y)  —  Cov(X,  Y)  =  0. 


In  general,  the  converse  is  not  true,  as  the  following  example  shows. 


Example  3.4  Consider  a  standard  normally-distributed  random  variable  X  and  a 
random  variable  Y  —  X2,  which  is  surely  not  independent  of  X .  Here  we  have 


Cov(X,  Y )  =  E (XY)  -  E(X)  E (Y)  =  E(X 3)  =  0 


(because  E(X)  =  0  and  E(X2)  =  1).  Therefore  p(X,  Y )  =  0,  as  well.  This  example 
also  shows  that  correlations  and  covariances  measure  only  linear  dependence.  The 
quadratic  dependence  of  Y  =  I2  on  I  is  not  reflected  by  these  measures  of 
dependence. 
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3  Moving  to  Higher  Dimensions 


Remark  3.1  For  two  normal  random  variables,  the  converse  of  Theorem  3.1 
is  true:  zero  covariance  for  two  normally  distributed  random  variables  implies 
independence.  This  will  be  shown  later  in  Corollary  5.2. 

Theorem  3.1  enables  us  to  check  for  independence  between  the  components  of 
a  bivariate  normal  random  variable.  That  is,  we  can  use  the  correlation  and  test 
whether  it  is  zero.  The  distribution  of  rXY  for  an  arbitrary  ( X ,  Y)  is  unfortunately 
complicated.  The  distribution  of  rXY  will  be  more  accessible  if  ( X ,  Y )  are  jointly 
normal  (see  Chap.  5).  If  we  transform  the  correlation  by  Fisher’s  Z-transformation, 


V  1  -  rXY  ) 


(3.11) 


we  obtain  a  variable  that  has  a  more  accessible  distribution.  Under  the  hypothesis 
that  p  =  0,  W  has  an  asymptotic  normal  distribution.  Approximations  of  the 
expectation  and  variance  of  W  are  given  by  the  following: 


E(W0*H]og(l±£) 

Var(^)  * 

The  distribution  is  given  in  Theorem  3.2. 

Theorem  3.2 


(3.12) 


W  -  E(  W ) 
/ Var(W ) 


L  N( 0, 1). 


(3.13) 


The  symbol  “ — >”  denotes  convergence  in  distribution,  which  will  be  explained 
in  more  detail  in  Chap.  4. 

Theorem  3.2  allows  us  to  test  different  hypotheses  on  correlation.  We  can  fix  the 
level  of  significance  a  (the  probability  of  rejecting  a  true  hypothesis)  and  reject  the 
hypothesis  if  the  difference  between  the  hypothetical  value  and  the  calculated  value 
of  Z  is  greater  than  the  corresponding  critical  value  of  the  normal  distribution.  The 
following  example  illustrates  the  procedure. 


Example  3.5  Let’s  study  the  correlation  between  mileage  (X2)  and  weight  (Xg)  for 
the  car  data  set  (22.3)  where  n  =  74.  We  have  rXlXs  —  —0.823.  Our  conclusions 
from  the  boxplot  in  Fig.  1.3  (“Japanese  cars  generally  have  better  mileage  than  the 
others”)  needs  to  be  revised.  From  Fig.  3.3  and  rXlX 8,  we  can  see  that  mileage  is 
highly  correlated  with  weight,  and  that  the  Japanese  cars  in  the  sample  are  in  fact 
all  lighter  than  the  others. 

If  we  want  to  know  whether  pXlX%  is  significantly  different  from  po  =  0,  we 
apply  Fisher’s  Z-transform  (3.11).  This  gives  us 


w  — 


( i  +  r *2*8  \ 
V  1  -  rx2xs  ) 


-1.166-0 


=  —1.166  and  z  — 


71 


=  -9.825, 


3.2  Correlation 
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Fig.  3.3  Mileage  (X2)  vs. 
weight  (Xg)  of  US  (star), 
European  (plus  signs )  and 
Japanese  ( circle )  cars  Q 
MVAscacar 


Car  Data 


Mileage  (X2) 


i.e.  a  highly  significant  value  to  reject  the  hypothesis  that  p  —  0  (the  2.5  %  and 
97.5  %  quantiles  of  the  normal  distribution  are  —1.96  and  1.96,  respectively).  If  we 
want  to  test  the  hypothesis  that,  say,  po  =  —0.75,  we  obtain: 


-1.166-  (-0.973) 


-1.627. 


This  is  a  non-significant  value  at  the  a  =  0.05  level  for  z  since  it  is  between  the 
critical  values  at  the  5  %  significance  level  (i.e.  —1.9 6  <  z  <  1.96). 

Example  3.6  Let  us  consider  again  the  pullovers  data  set  from  Example  3.2. 
Consider  the  correlation  between  the  presence  of  the  sales  assistants  (X4)  vs.  the 
number  of  sold  pullovers  (Xi)  (see  Fig.  3.4).  Here  we  compute  the  correlation  as 


rZiJV4  =  0.633. 


The  Z -transform  of  this  value  is 

LLEl£i)  =  0.746.  (3.14) 

1  —  Lux4  ) 

The  sample  size  is  n  —  10,  so  for  the  hypothesis  px Xx4  —  0,  the  statistic  to 
consider  is: 


z  =  77(0.746  -  0)  =  1.974  (3.15) 

which  is  just  statistically  significant  at  the  5  %  level  (i.e.  1.974  is  just  a  little  larger 
than  1.96). 
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3  Moving  to  Higher  Dimensions 


Fig.  3.4  Hours  of  sales 
assistants  (X4)  vs.  sales  (Xi) 
of  pullovers  Q 
MVAscapull2 

200 


X 
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<u 

«  150 
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70  80  90  100  110  120 

Sales  Assistants  (X4) 


Pullovers  Data 


Remark  3.2  The  normalising  and  variance  stabilising  properties  of  W  are  asymp¬ 
totic.  In  addition  the  use  of  W  in  small  samples  (for  n  <  25)  is  improved  by 
Hotelling’s  transform  (Hotelling,  1953): 

t*,*  tt7  3W  tanh(IT)  .  u  1 

W  —  14 - - - - -  with  Var(  14)  = - . 

4  (n  —  1)  n  —  1 

The  transformed  variable  IT*  is  asymptotically  distributed  as  a  normal  distribution. 

Example  3.7  From  the  preceding  remark,  we  obtain  w*  =  0.6663  and 

V10-  1  w*  =  1.9989  for  the  preceding  Example  3.6.  This  value  is  significant 
at  the  5  %  level. 

Remark  3.3  Note  that  the  Fisher’s  Z-transform  is  the  inverse  of  the  hyperbolic 
tangent  function:  W  —  tanh  (rXy) ;  equivalently  rXY  —  tanh(JT)  =  ee2w+\  • 

Remark  3.4  Under  the  assumptions  of  normality  of  X  and  Y,  we  may  test  their 
independence  ( pXy  —  0)  using  the  exact  t  -distribution  of  the  statistic 


n  2  PXY= 0 

1  =rXYJ- - 5-  ~  tj 7-2. 


1  —  r  ~ 

1  rXY 


Setting  the  probability  of  the  first  error  type  to  ot,  we  reject  the  null  hypothesis 
PXY  =  0  if  \T\  >  t\—0[/2\n—2  • 


3.3  Summary  Statistics 
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'  Summary 

^  The  correlation  is  a  standardised  measure  of  dependence. 

^  The  absolute  value  of  the  correlation  is  always  less  or  equal  to  one. 

^  Correlation  measures  only  linear  dependence. 

^  There  are  non-linear  dependencies  that  have  zero  correlation. 

Zero  correlation  does  not  imply  independence.  For  two  normal 
random  variables,  it  does. 

^  Independence  implies  zero  correlation. 

Negative  correlation  corresponds  to  downward- sloping  scatter- 
plots. 

Positive  correlation  corresponds  to  upward- sloping  scatterplots. 

Fisher’s  Z-transform  helps  us  in  testing  hypotheses  on  correlation. 

For  small  samples,  Fisher’s  Z-transform  can  be  improved  by  the 
transformation  W*  =  W  —  w+tanf(*F) 

4(n— 1) 


3.3  Summary  Statistics 


This  section  focuses  on  the  representation  of  basic  summary  statistics  (means, 
covariances  and  correlations)  in  matrix  notation,  since  we  often  apply  linear 
transformations  to  data.  The  matrix  notation  allows  us  to  derive  instantaneously 
the  corresponding  characteristics  of  the  transformed  variables.  The  Mahalanobis 
transformation  is  a  prominent  example  of  such  linear  transformations. 

Assume  that  we  have  observed  n  realisations  of  a  -dimensional  random 
variable;  we  have  a  data  matrix  X(n  x  p)\ 


(3.16) 


The  rows  x,  =  (xn, ,  XjP)  e  Rp  denote  the  i th  observation  of  a  -dimensional 
random  variable 
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3  Moving  to  Higher  Dimensions 


The  statistics  that  were  briefly  introduced  in  Sects.  3.1  and  3.2  can  be  rewritten  in 
matrix  form  as  follows.  The  “centre  of  gravity”  of  the  n  observations  in  is  given 
by  the  vector  x  of  the  means  Xj  of  the  p  variables: 


x 


Xl  \ 


=  n 


-l 


XT1 


n 


xp  / 


(3.17) 


The  dispersion  of  the  n  observations  can  be  characterised  by  the  covariance 
matrix  of  the  p  variables.  The  empirical  covariances  defined  in  (3.2)  and  (3.3)  are 
the  elements  of  the  following  matrix: 

5  =  n~lXTX-xxT  =  n~1(XT  X  —  n~l  XT  lnlj  X).  (3.18) 


Note  that  this  matrix  is  equivalently  defined  by 


l  n 

S  —  —  y  (xi  —  x)  (xi  —  x)T 
n  A ^ 

i  =  1 


The  covariance  formula  (3.18)  can  be  rewritten  as  S  —  n  lXJ/HX  with  the 
centering  matrix 

H  =  Xn  -n-1l„lj.  (3.19) 


Note  that  the  centering  matrix  is  symmetric  and  idempotent.  Indeed, 

n2  =  l„lj) 

=  l„-n-x  l„lj  =  H. 

As  a  consequence  S  is  positive  semidefinite,  i.e. 


S>0. 


(3.20) 


Indeed  for  all  a  e 


aTSa  =  n  laT XTl~iXa 

-1/  T  vTi/T 


n~l  (a  1  V  1  n  1  ){UXa)  since  HTH 

P 

n~lyTy  =  n~l  ^yj  >  0 

;= i 


=  U. 


3.3  Summary  Statistics 


91 


for  y  —  T-LXo.  It  is  well  known  from  the  one-dimensional  case  that  n~l  Y^=i  (xi  ~ 
x)2  as  an  estimate  of  the  variance  exhibits  a  bias  of  the  order  n~l  (Breiman,  1973). 
In  the  multi-dimensional  case,  Su  =  S  is  an  unbiased  estimate  of  the  true 
covariance.  (This  will  be  shown  in  Example  4.15.) 

The  sample  correlation  coefficient  between  the  i  th  and  j  th  variables  is  rxt  Xj ,  see 
(3.8).  If  V  —  dmg(sXiXi),  then  the  correlation  matrix  is 

n  =  V~l/2SV~l/2,  (3.21) 

where  D-1/2  is  a  diagonal  matrix  with  elements  (sxiXi)~l  2  on  its  main  diagonal. 

Example  3.8  The  empirical  covariances  are  calculated  for  the  pullover  data  set. 

The  vector  of  the  means  of  the  four  variables  in  the  dataset  is  x  — 
(172.7, 104.6, 104.0,  93.8)t. 


The  sample  covariance  matrix  is  S  — 


/  1037.2  -80.2  1430.7  271.4  \ 
-80.2  219.8  92.1  -91.6 

1430.7  92.1  2624  210.3 

V  271.4-91.6  210.3  177.4/ 


The  unbiased  estimate  of  the  variance  ( n  =  10)  is  equal  to 


Su  =  —  5  = 


{ 1 152.5 
-88.9 
1589.7 
V  301.6 


-88.9  1589.7 

244.3  102.3 

102.3  2915.6 


301.6^ 
-101.8 
233.7 


The  sample  correlation  matrix  is  7 Z  — 


101.8  233.7 

197.1/ 

( 

1 

-0.17 

0.87 

0.63  \ 

-0.17 

1 

0.12  - 

-0.46 

0.87 

0.12 

1 

0.31 

0.63 

-0.46 

0.31 

1  / 

Linear  Transformation 

In  many  practical  applications  we  need  to  study  linear  transformations  of  the 
original  data.  This  motivates  the  question  of  how  to  calculate  summary  statistics 
after  such  linear  transformations. 

Let  A  be  a  (q  x  p)  matrix  and  consider  the  transformed  data  matrix 


y  =  XAt  =  (ji, . . .  ,yn)T  ■ 


(3.22) 
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3  Moving  to  Higher  Dimensions 


The  row  y*  —  (yn, . . . ,  ytq)  e  can  be  viewed  as  the  i  th  observation 
of  a  g -dimensional  random  variable  Y  =  AX.  In  fact  we  have  y?-  —  x,*4T. 
We  immediately  obtain  the  mean  and  the  empirical  covariance  of  the  variables 
(columns)  forming  the  data  matrix  y : 

y  =  ~yT  ln  —  -AXT  \n  =  Ax 

n  n 

sy  =  -yTny  =  -axthxat  -  asxat. 

n  n 

Note  that  if  the  linear  transformation  is  non-homogeneous,  i.e. 

yi  =  Axj  +  b  where  b{q  x  1), 

only  (3.23)  changes:  y  =  Ax  +  b.  The  formulas  (3.23)  and  (3.24)  are  useful  in  the 
particular  case  of  q  —  1,  i.e.  y  —  Xa,  i.e.  yz  =  aT  Xj ;  i  —  1 , ,n: 

y  =  aTx 
Sy  —  aJ Sxa. 


(3.23) 

(3.24) 


Example  3.9  Suppose  that  is  the  pullover  data  set.  The  manager  wants  to 
compute  his  mean  expenses  for  advertisement  (X3)  and  sales  assistant  (X4). 

Suppose  that  the  sales  assistant  charges  an  hourly  wage  of  10  EUR.  Then  the 
shop  manager  calculates  the  expenses  Y  as  Y  —  X3  +  IOX4.  Formula  (3.22)  says 
that  this  is  equivalent  to  defining  the  matrix  *4(4  x  1)  as: 

A  —  (0,0,1,10). 

Using  formulas  (3.23)  and  (3.24),  it  is  now  computationally  very  easy  to  obtain  the 
sample  mean  y  and  the  sample  variance  Sy  of  the  overall  expenses: 


y  —  Ax  —  (0, 0, 1, 10) 


/ 172. 7\ 
104.6 
104.0 

V  93.8/ 


1042.0 


Sy  =ASxAt  =  (0,0,1,10) 


/1 152.5  -88.9  1589.7  301.6\ 
-88.9  244.3  102.3  -101.8 
1589.7  102.3  2915.6  233.7 

V  301.6  -101.8  233.7  197.1/ 


°\ 

0 

1 

V 10/ 


=  2915.6  +  4674  +  19710  =  27299.6. 


3.4  Linear  Model  for  Two  Variables 
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Mahalanobis  Transformation 

A  special  case  of  this  linear  transformation  is 

Zi  =  S~l/2(xi —x),  i  —  \,...,n.  (3.25) 

Note  that  for  the  transformed  data  matrix  Z  =  (zi, . . . ,  zn)T , 

Sz  =  n~lZTUZ  =  lp.  (3.26) 

So  the  Mahalanobis  transformation  eliminates  the  correlation  between  the  variables 
and  standardises  the  variance  of  each  variable.  If  we  apply  (3.24)  using  A  =  S~1^2, 
we  obtain  the  identity  covariance  matrix  as  indicated  in  (3.26). 


Summary 

The  centre  of  gravity  of  a  data  matrix  is  given  by  its  mean  vector 
x  —  n~lXT ln. 

The  dispersion  of  the  observations  in  a  data  matrix  is  given  by  the 
empirical  covariance  matrix  S  —  n~l  XT7~LX. 

The  empirical  correlation  matrix  is  given  by  7Z  =  V~[^2SV~^2 . 

A  linear  transformation  y  =  XAT  of  a  data  matrix  A  has  mean 
Ax  and  empirical  covariance  ASx  AT . 

The  Mahalanobis  transformation  is  a  linear  transformation  z/  = 
S~l^2{xi  —  x)  which  gives  a  standardised,  uncorrelated  data 
matrix  Z. 


3.4  Linear  Model  for  Two  Variables 

We  have  looked  several  times  now  at  downward  and  upward- sloping  scatterplots. 
What  does  the  eye  define  here  as  a  slope?  Suppose  that  we  can  construct  a  line 
corresponding  to  the  general  direction  of  the  cloud.  The  sign  of  the  slope  of  this 
line  would  correspond  to  the  upward  and  downward  directions.  Call  the  variable  on 
the  vertical  axis  Y  and  the  one  on  the  horizontal  axis  X .  A  slope  line  is  a  linear 
relationship  between  X  and  Y : 


yi  —  a  +  pXi  +  Si,  i  —  l, ...  ,n. 


(3.27) 


94 


3  Moving  to  Higher  Dimensions 


Here,  a  is  the  intercept  and  f)  is  the  slope  of  the  line.  The  errors  (or  deviations  from 
the  line)  are  denoted  as  4  and  are  assumed  to  have  zero  mean  and  finite  variance 
cr2.  The  task  of  finding  (a,  f3 )  in  (3.27)  is  referred  to  as  a  linear  adjustment. 

In  Sect.  3.6  we  shall  derive  estimators  for  a  and  f}  more  formally,  as  well  as 
accurately  describe  what  a  “good”  estimator  is.  For  now,  one  may  try  to  find  a 

A 

“good”  estimator  (a,  /3)  via  graphical  techniques.  A  very  common  numerical  and 
statistical  technique  is  to  use  those  a  and  f}  that  minimise: 


n 


(a,  f})  =  argminy^(j,-  -  a-  fixi)2. 


(fX,P) 


i  =  1 


The  solution  to  this  task  are  the  estimators: 


Sxx 


a  —  y  —  fix. 


/V 

The  variance  of  is: 


Var(/b  =  - . 

n  • 

The  standard  error  (SE)  of  the  estimator  is  the  square  root  of  (3.31), 

SE<»  = 


(3.28) 

(3.29) 

(3.30) 

(3.31) 

(3.32) 


We  can  use  this  formula  to  test  the  hypothesis  that  /3  =  0.  In  an  application  the 
variance  a2  has  to  be  estimated  by  an  estimator  a2  that  will  be  given  below.  Under 
a  normality  assumption  of  the  errors,  the  Me st  for  the  hypothesis  P  —  0  works  as 
follows. 

One  computes  the  statistic 


A 


(3.33) 


and  rejects  the  hypothesis  at  a  5  %  significance  level  if  |  t  |>  fo.975;w-2»  where  the 
97.5  %  quantile  of  the  Student’s  4-2  distribution  is  clearly  the  95  %  critical  value 
for  the  two-sided  test.  For  n  >  30,  this  can  be  replaced  by  1.96,  the  97.5  %  quantile 
of  the  normal  distribution.  An  estimator  d2  of  a2  will  be  given  in  the  following. 

Example  3.10  Let  us  apply  the  linear  regression  model  (3.27)  to  the  “classic  blue” 
pullovers.  The  sales  manager  believes  that  there  is  a  strong  dependence  on  the 


3.4  Linear  Model  for  Two  Variables 


95 


Pullovers  Data 


Fig.  3.5  Regression  of  sales  (Xi)  on  price  (X2)  of  pullovers  O  MVAregpull 


number  of  sales  as  a  function  of  price.  He  computes  the  regression  line  as  shown  in 
Fig.  3.5. 

How  good  is  this  fit?  This  can  be  judged  via  goodness-of-fit  measures.  Define 

yi=a  +  j3xi,  (3.34) 

as  the  predicted  value  of  y  as  a  function  of  x.  With  y  the  textile  shop  manager  in 
the  above  example  can  predict  sales  as  a  function  of  prices  x.  The  variation  in  the 
response  variable  is: 


iisyy  =  yi  -  y)2.  (3.35) 

/  =  1 

The  variation  explained  by  the  linear  regression  (3.27)  with  the  predicted  values 
(3.34)  is: 


n 

Y^iyi-y)2-  (3-36) 

/  =  1 

The  residual  sum  of  squares,  the  minimum  in  (3.28),  is  given  by: 

n 

rss  -  -yX- 

7—1 


(3.37) 
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3  Moving  to  Higher  Dimensions 


An  unbiased  estimator  a2  of  a2  is  given  by  RSS /(n  —  2). 

The  following  relation  holds  between  (3.35)  and  (3.37): 

n  n  n 

FA  -  y)2  =  FA  -  y)2  +  XA  _  ?>)2>  (3-38) 

i  =  1  i  =  1  /  =  1 

Total  variation  =  Explained  variation  +  Unexplained  variation. 

The  coefficient  of  determination  is  r2: 


V  (ji  —  y)2 

9  /  =  i  explained  variation 

r2  =  — - =  — - (3.39) 

11  total  variation 

i  =  1 


The  coefficient  of  determination  increases  with  the  proportion  of  explained  variation 
by  the  linear  relation  (3.27).  In  the  extreme  cases  where  r2  —  1,  all  of  the  variation 
is  explained  by  the  linear  regression  (3.27).  The  other  extreme,  r2  =  0,  is  where 
the  empirical  covariance  is  Sxy  =  0.  The  coefficient  of  determination  can  be 
rewritten  as 


E  (j;  -  y) 


E(j<  -?i)2 

r2  =  1  -  ^ - .  (3.40) 

E  (j;  -  7)2 

/  =  1 

From  (3.39),  it  can  be  seen  that  in  the  linear  regression  (3.27),  r2  =  r\Y  is  the 
square  of  the  correlation  between  X  and  Y . 

Example  3.11  For  the  above  pullover  example,  we  estimate 

a  =  210.774  and  f>  =  -0.364. 

The  coefficient  of  determination  is 


r2  =  0.028. 

The  textile  shop  manager  concludes  that  sales  are  not  influenced  very  much  by  the 
price  (in  a  linear  way). 

The  geometrical  representation  of  formula  (3.38)  can  be  graphically  evaluated 
using  Fig.  3.6.  This  plot  shows  a  section  of  the  linear  regression  of  the  “sales” 
on  “price”  for  the  pullovers  data.  The  distance  between  any  point  and  the  overall 
mean  is  given  by  the  distance  between  the  point  and  the  regression  line  and  the 
distance  between  the  regression  line  and  the  mean.  The  sums  of  these  two  distances 
represent  the  total  variance  (solid  blue  lines  from  the  observations  to  the  overall 
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Fig.  3.6  Regression  of  sales 
(Xi)  on  price  {Xf)  of 
pullovers.  The  overall  mean  is 
given  by  the  dashed  line  Q 
MVAregzoom 


Pullover  Data 


Fig.  3.7  Regression  of  X5 
(1 upper  inner  frame)  on  X4 
( lower  inner  frame)  for 
genuine  bank  notes  Q 
MVAregbank 


Swiss  bank  notes 


mean),  i.e.  the  explained  variance  (distance  from  the  regression  curve  to  the  mean) 
and  the  unexplained  variance  (distance  from  the  observation  to  the  regression  line), 
respectively. 

In  general  the  regression  of  Y  on  X  is  different  from  that  of  X  on  Y .  We  will 
demonstrate  this,  once  again,  using  the  Swiss  bank  notes  data. 

Example  3.12  The  least  squares  fit  of  the  variables  X4  (X)  and  X5  (Y)  from 
the  genuine  bank  notes  are  calculated.  Figure  3.7  shows  the  fitted  line  if  X5  is 
approximated  by  a  linear  function  of  X4.  In  this  case  the  parameters  are 

a  =  15.464  and  /§  =  —0.638. 
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3  Moving  to  Higher  Dimensions 


If  we  predict  X4  by  a  function  of  X5  instead,  we  would  arrive  at  a  different 
intercept  and  slope 


a  =  14.666  and  /3  =  -0.626. 


The  linear  regression  of  Y  on  X  is  given  by  minimising  (3.28),  i.e.  the  vertical 
errors  £/ .  The  linear  regression  of  X  on  Y  does  the  same,  but  here  the  errors 
to  be  minimised  in  the  least  squares  sense  are  measured  horizontally.  As  seen  in 
Example  3.12,  the  two  least  squares  lines  are  different  although  both  measure  (in  a 
certain  sense)  the  slope  of  the  cloud  of  points. 

As  shown  in  the  next  example,  there  is  still  one  other  way  to  measure  the 
main  direction  of  a  cloud  of  points:  it  is  related  to  the  spectral  decomposition  of 
covariance  matrices. 

Example  3.13  Suppose  that  we  have  the  following  covariance  matrix: 


Figure  3.8  shows  a  scatterplot  of  a  sample  of  two  normal  random  variables  with 
such  a  covariance  matrix  (with  p  —  0.8). 

The  eigenvalues  of  £  are,  as  was  shown  in  Example  2.4,  solutions  to: 


1  —  A  p 
p  1  —  A 
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Fig.  3.8  Scatterplot  for  a 
sample  of  two  correlated 
normal  random  variables 
(sample  size  n  =  150, 
p  =  0.8)  Q  MVAcorrnorm 
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Hence,  X\  —  1  +  p  and  A  2  =  1  —  p.  Therefore  A  =  diag(l  +  p,  1  —  p).  The 
eigenvector  corresponding  to  X\  —  1  +  p  can  be  computed  from  the  system  of 
linear  equations: 

\pl  J  \x2  J  \x2J 

or 

X ]  px 2  =  X]  T  pXi 

pxi  +  x2  =  x2  +  px2 

and  thus 


Xi  =  x2. 


The  first  (standardised)  eigenvector  is 


The  direction  of  this  eigenvector  is  the  diagonal  in  Fig.  3.8  and  captures  the  main 
variation  in  this  direction.  We  shall  come  back  to  this  interpretation  in  Chap.  1 1 .  The 
second  eigenvector  (orthogonal  to  y1 )  is 


So  finally 


r  =  (Yi .  y2) 


1/V2  l/V2\ 
1/V2-1/V2  J 


and  we  can  check  our  calculation  by 

s  =  r  a  rT  . 

The  first  eigenvector  captures  the  main  direction  of  a  point  cloud.  The  linear 
regression  of  Y  on  X  and  X  on  Y  accomplished,  in  a  sense,  the  same  thing.  In 
general  the  direction  of  the  eigenvector  and  the  least  squares  slope  are  different. 
The  reason  is  that  the  least  squares  estimator  minimises  either  vertical  or  horizontal 
errors  (in  (3.28)),  whereas  the  first  eigenvector  corresponds  to  a  minimisation  that 
is  orthogonal  to  the  eigenvector  (see  Chap.  11). 
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3  Moving  to  Higher  Dimensions 


uu 


'  Summary 

The  linear  regression  y  —  a  +  fix  +  e  models  a  linear  relation 
between  two  one-dimensional  variables. 

yv 

^  The  sign  of  the  slope  /3  is  the  same  as  that  of  the  covariance  and  the 
correlation  of  x  and  y . 

^  A  linear  regression  predicts  values  of  Y  given  a  possible  observa¬ 
tion  x  of  X. 

^  The  coefficient  of  determination  r2  measures  the  amount  of  varia¬ 
tion  in  Y  which  is  explained  by  a  linear  regression  on  X . 

^  If  the  coefficient  of  determination  is  r2  =  1,  then  all  points  lie  on 
one  line. 

^  The  regression  line  of  X  on  Y  and  the  regression  line  of  Y  on  A 
are  in  general  different. 

A. 

^  The  t- test  for  the  hypothesis  /3  =  0  is  t  —  where  SE(/3)  = 

A 

<J 

(n -sXx)  _ 

The  t- test  rejects  the  null  hypothesis  P  —  0  at  the  level  of 
significance  o'  if  |  t  |>  t\-a/2\n-2  where  t\-a^n-2  is  the  1  —  a/2 
quantile  of  the  Student’s  £ -distribution  with  (n  —  2)  degrees  of 
freedom. 

- 7^ - 

^  The  standard  error  SE(/3)  increases/decreases  with  less/more 
spread  in  the  X  variables. 

^  The  direction  of  the  first  eigenvector  of  the  covariance  matrix  of 
a  two-dimensional  point  cloud  is  different  from  the  least  squares 
regression  line. 


3.5  Simple  Analysis  of  Variance 

In  a  simple  (i.e.  one-factorial)  analysis  of  variance  (ANOVA),  it  is  assumed  that 
the  average  values  of  the  response  variable  y  are  induced  by  one  simple  factor. 
Suppose  that  this  factor  takes  on  p  values  and  that  for  each  factor  level,  we  have 
m  —  n/ p  observations.  The  sample  is  of  the  form  given  in  Table  3.1,  where  all  of 
the  observations  are  independent. 

The  goal  of  a  simple  ANOVA  is  to  analyse  the  observation  structure 

Yki  =  /x/  +  Ski  for  k  =  1, . . . ,  m,  and  /  =  1 (3.41) 

Each  factor  has  a  mean  value  pi .  Each  observation  yu  is  assumed  to  be  a  sum  of  the 
corresponding  factor  mean  value  /x/  and  a  zero  mean  random  error  s^.  The  linear 
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Table  3.1  Observation 
structure  of  a  simple  ANOVA 


Sample  element 
1 


yu 


Factor  levels  / 
yu 


2 


y\P 


k 


yk  i  •  ••  yu  •  •  •  ykp 


y?n  1 


ymi 


ymp 


Table  3.2  Pullover  sales  as 
function  of  marketing 
strategy 


Shop 

Marketing  strategy 

k 

Factor  / 

1 

2 

3 

1 

9 

10 

18 

2 

11 

15 

14 

3 

10 

11 

17 

4 

12 

15 

9 

5 

7 

15 

14 

6 

11 

13 

17 

7 

12 

7 

16 

8 

10 

15 

14 

9 

11 

13 

17 

10 

13 

10 

15 

regression  model  falls  into  this  scheme  with  m  —  1,  p  —  n  and  /x;  =  a  +  /3x/, 
where  x,  is  the  i  th  level  value  of  the  factor. 

Example  3.14  The  “classic  blue”  pullover  company  analyses  the  effect  of  three 
marketing  strategies 

1 .  advertisement  in  local  newspaper, 

2.  presence  of  sales  assistant, 

3.  luxury  presentation  in  shop  windows. 

All  of  these  strategies  are  tried  in  ten  different  shops.  The  resulting  sale 
observations  are  given  in  Table  3.2. 

There  are  p  —  3  factors  and  n  —  mp  —  30  observations  in  the  data.  The  “classic 
blue”  pullover  company  wants  to  know  whether  all  three  marketing  strategies  have 
the  same  mean  effect  or  whether  there  are  differences.  Having  the  same  effect 
means  that  all  /x/  in  (3.41)  equal  one  value,  /x.  The  hypothesis  to  be  tested  is 
therefore 


Hq  :  pi  =  /x  for  /  =  1 , . . . ,  p. 
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3  Moving  to  Higher  Dimensions 


The  alternative  hypothesis,  that  the  marketing  strategies  have  different  effects,  can 
be  formulated  as 


H\  :  ill  ^  iii ’  for  some  /  and  l' . 

This  means  that  one  marketing  strategy  is  better  than  the  others. 

The  method  used  to  test  this  problem  is  to  compute  as  in  (3.38)  the  total  variation 
and  to  decompose  it  into  the  sources  of  variation.  This  gives: 

pm  p  p  m 

33  XXv,c/  -y)2  =  ™  33  (.w  -  yf  +  33  £(?« -  ytf  (3-42) 

1  =  1  k  =  1  /  =  1  / = 1  k  =  1 

The  total  variation  (sum  of  squares  =  SS)  is: 

p  m 

SS  (reduced)  =  33  “  j)2  (3.43) 

/  =  1  k  =  1 

where  y  =  YHi=\  Y17=  1  Tw  is  the  overall  mean.  Here  the  total  variation  is 
denoted  as  SS  (reduced),  since  in  comparison  with  the  model  under  the  alternative 
H i,  we  have  a  reduced  set  of  parameters.  In  fact  there  is  1  parameter  (i  —  /x/ 
under  //o-  Under  H i,  the  “full”  model,  we  have  three  parameters,  namely  the  three 
different  means  fii . 

The  variation  under  H\  is  therefore: 


m 


SS(full)  = 


-  \2 


(yu  -  yi) 


(3.44) 


/ = 1 k= I 


where  yi  —  m~l  J2k= l  Tw  is  the  mean  of  each  factor  /.  The  hypothetical  model  //o 
is  called  reduced,  since  it  has  (relative  to  H\)  fewer  parameters. 

The  F- test  of  the  linear  hypothesis  is  used  to  compare  the  difference  in  the 
variations  under  the  reduced  model  Hq  (3.43)  and  the  full  model  H\  (3.44)  to  the 
variation  under  the  full  model  H\ : 


F  = 


{SS(reduced)  —  SS(full  )}/{df(r)  —  df(f)} 

SS(full)/  df(f) 


(3.45) 


Here  df(f )  and  df(r )  denote  the  degrees  of  freedom  under  the  full  model  and 
the  reduced  model,  respectively.  The  degrees  of  freedom  are  essential  in  spec¬ 
ifying  the  shape  of  the  F -distribution.  They  have  a  simple  interpretation:  df(-) 
is  equal  to  the  number  of  observations  minus  the  number  of  parameters  in  the 
model. 
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From  Example  3.14,  p  —  3  parameters  are  estimated  under  the  full  model,  i.e. 
df(f)  —  n  —  p  —  30  —  3  =  27.  Under  the  reduced  model,  there  is  one  parameter 
to  estimate,  namely  the  overall  mean,  i.e.  df(r )  =  n  —  1  =  29.  We  can  compute 

SS  (reduced)  =  260.3 


and 


SS(full)  =  157.7. 

The  F -statistic  (3.45)  is  therefore 

(260.3-  157.7)  /  2 

F  =  - - - - —  =  8.78. 

157.7/27 

This  value  needs  to  be  compared  to  the  quantiles  of  the  F2, 27  distribution.  Looking 
up  the  critical  values  in  a  F -distribution  shows  that  the  test  statistic  above  is  highly 
significant.  We  conclude  that  the  marketing  strategies  have  different  effects. 


The  F -Test  in  a  Linear  Regression  Model 

The  Utest  of  a  linear  regression  model  can  be  put  into  this  framework.  For  a  linear 
regression  model  (3.27),  the  reduced  model  is  the  one  with  /3  =  0: 

37  =  OL  +  0  •  Xi  +  £j . 

The  reduced  model  has  n  —  1  degrees  of  freedom  and  one  parameter,  the  intercept  a. 
The  full  model  is  given  by  /3  ^  0, 

yi  =  OL  +  P  •  Xi  +  Si , 

and  has  n  —  2  degrees  of  freedom,  since  there  are  two  parameters  (a,  /3). 

The  SS  (reduced)  equals 


n 

SS  (reduced)  =  ^^(j?  —  y)2  —  total  variation. 

/  =  1 


The  SS(full)  equals 


n 

SS(full)  =  —  yi)2  —  RSS  =  unexplained  variation. 
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3  Moving  to  Higher  Dimensions 


The  F- test  is  therefore,  from  (3.45), 

(total  variation  —  unexplained  variation)  /I 

F  —  - - - 

(unexplained  variation)/  ( n  —  2) 


(3.46) 


explained  variation 
(unexplained  variation)/  (n  —  2) 

/V 

Using  the  estimators  &  and  ft  the  explained  variation  is: 

&  ~  F  =  it,  (“  +  PXi  ~  y) 

i  =  1  i  =  1 

= -  p*) + Px>  - 

i  =  1 

=  -  X  f 

i  =  1 

=  P2nsXx ■ 

From  (3.32)  the  F -ratio  (3.46)  is  therefore: 


P  nsxx 
RSS/(n  —  2) 


(3.47) 


(3.48) 

(3.49) 


The  Mest  statistic  (3.33)  is  just  the  square  root  of  the  ^-statistic  (3.49). 
Note,  using  (3.39)  the  ^-statistic  can  be  rewritten  as 


r2/ 1 

(1  —  r2)/(n  —  2) 

In  the  pullover  Example  3.11,  we  obtain  F  —  —  0-2305,  so  that  the  null 

hypothesis  P  —  0  cannot  be  rejected.  We  conclude  therefore  that  there  is  only  a 
minor  influence  of  prices  on  sales. 


3.6  Multiple  Linear  Model 
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'  Summary 

^  Simple  ANOVA  models  an  output  Y  as  a  function  of  one  factor. 


^  The  reduced  model  is  the  hypothesis  of  equal  means. 


^  The  full  model  is  the  alternative  hypothesis  of  different  means. 


The  F- test  is  based  on  a  comparison  of  the  sum  of  squares  under 
the  full  and  the  reduced  models. 

The  degrees  of  freedom  are  calculated  as  the  number  of  observa¬ 
tions  minus  the  number  of  parameters. 

The  F -statistic  is 


{SS(reduced)  —  SS(full  )}/{df(r)  —  df(f)} 

SS(full  )/<//(/) 


The  F- test  rejects  the  null  hypothesis  if  the  F -statistic  is  larger 
than  the  95  %  quantile  of  the  Fdf(r)-df{f),df(f)  distribution. 

The  F- test  statistic  for  the  slope  of  the  linear  regression  model 
yi  —  a  +  pXi  +  Si  is  the  square  of  the  t- test  statistic. 


3.6  Multiple  Linear  Model 

The  simple  linear  model  and  the  analysis  of  variance  model  can  be  viewed  as  a 
particular  case  of  a  more  general  linear  model  where  the  variations  of  one  variable  y 
are  explained  by  p  explanatory  variables  x  respectively.  Let  y  (n  x  1)  and  X  {nx  p) 
be  a  vector  of  observations  on  the  response  variable  and  a  data  matrix  on  the  p 
explanatory  variables.  An  important  application  of  the  developed  theory  is  the  least 
squares  fitting.  The  idea  is  to  approximate  y  by  a  linear  combination  y  of  columns 
of  A,  i.e.  y  e  C(A).  The  problem  is  to  find  |  g  such  that  y  —  Xf3  is  the  best 
fit  of  y  in  the  least- squares  sense.  The  linear  model  can  be  written  as 

y  =  Xfi  +  s,  (3.50) 

/V 

where  s  are  the  errors.  The  least  squares  solution  is  given  by  /3: 


P  —  argrnin  (y  —  X/3)T  (y  —  Xj3)  —  argrnin  s 


T 


(3.51) 
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3  Moving  to  Higher  Dimensions 


Suppose  that  (XT X)  is  of  full  rank  and  thus  invertible.  Minimising  the  expres¬ 
sion  (3.51)  with  respect  to  p  yields: 


P  =  (XTX)~1XTy. 


(3.52) 


The  fitted  value  y  —  X/3  —  X(XTX)~lXTy  =  Vy  is  the  projection  of  y  onto 
C(X)  as  computed  in  (2.47). 

The  least  squares  residuals  are 

e  =  y-y  =  y-Xj}  =  Qy  =  (ln-  V)y. 

The  vector  e  is  the  projection  of  y  onto  the  orthogonal  complement  of  C(X). 

Remark  3.5  A  linear  model  with  an  intercept  a  can  also  be  written  in  this 
framework.  The  approximating  equation  is: 

y\  —  ot  +  +  •  •  •  +  PpXip  +  £/  ;  i  —  1 , . . . ,n. 


This  can  be  written  as: 


y  =  X*  ft*  +  s 


where  A*  =  (\n  X)  (we  add  a  column  of  ones  to  the  data).  We  have  by  (3.52): 


p 


* 


=  (X*TX*rlX*Ty. 


Example  3.15  Let  us  come  back  to  the  “classic  blue”  pullovers  example.  In 
Example  3.11,  we  considered  the  regression  fit  of  the  sales  X\  on  the  price  X2 
and  concluded  that  there  was  only  a  small  influence  of  sales  by  changing  the  prices. 
A  linear  model  incorporating  all  three  variables  allows  us  to  approximate  sales  as 
a  linear  function  of  price  (X2),  advertisement  (A3)  and  presence  of  sales  assistants 
(X4)  simultaneously.  Adding  a  column  of  ones  to  the  data  (in  order  to  estimate  the 
intercept  a)  leads  to 


a  =  65.670  and  =  -0.216,  fi2  =  0.485,  fi3  =  0.844. 


The  coefficient  of  determination  is  computed  as  before  in  (3.40)  and  is: 


r 


2 


E  (y>  -  y  f 


0.907. 


We  conclude  that  the  variation  of  X\  is  well  approximated  by  the  linear  relation. 
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Remark  3.6  The  coefficient  of  determination  is  influenced  by  the  number  of 
regressors.  For  a  given  sample  size  n ,  the  r2  value  will  increase  by  adding  more 
regressors  into  the  linear  model.  The  value  of  r2  may  therefore  be  high  even  if 
possibly  irrelevant  regressors  are  included.  An  adjusted  coefficient  of  determination 
for  p  regressors  and  a  constant  intercept  (p  +  1  parameters)  is 


p(  1  —  r2) 
n  —  (p  +  1)’ 


(3.53) 


Example  3.16  The  corrected  coefficient  of  determination  for  Example  3. 15  is 


0.907  - 
0.818. 


3(1  -  0.9072) 
10-3-1 


This  means  that  81.8  %  of  the  variation  of  the  response  variable  is  explained  by  the 
explanatory  variables. 

Note  that  the  linear  model  (3.50)  is  very  flexible  and  can  model  non-linear 
relationships  between  the  response  y  and  the  explanatory  variables  x.  For  example, 
a  quadratic  relation  in  one  variable  x  could  be  included.  Then  jy  =  a  +  j3\Xi  + 
/?2X2  +  £f  could  be  written  in  matrix  notation  as  in  (3.50),  y  =  A /3  +  s  where 


A  = 


/  1  X\  x2 \ 
1  X2  X2 


V  i  x„  x2n  7 


Properties  of  /? 

When  yj  is  the  /  th  observation  of  a  random  variable  Y,  the  errors  are  also  random. 
Under  standard  assumptions  (independence,  zero  mean  and  constant  variance  a2), 
inference  can  be  conducted  on  /3.  Using  the  properties  of  Chap.  4,  it  is  easy  to  prove: 


E(/3)  =  P 

VarG 6)  =  a2(XTX)~\ 

The  analogue  of  the  Utest  for  the  multivariate  linear  regression  situation  is 


SE  Qj) 
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The  standard  error  of  each  coefficient  j  is  given  by  the  square  root  of  the  diagonal 

elements  of  the  matrix  Var(/3).  In  standard  situations,  the  variance  o2  of  the  error  s 
is  not  known.  For  linear  model  with  intercept,  one  may  estimate  it  by 


1 

n-  (p  +  1) 


(y-y)T(y-y). 


where  p  is  the  dimension  of  /3.  In  testing  fij  —  0  we  reject  the  hypothesis  at  the 
significance  level  a  if  \t\  >  ti-a/2;n-(p+i)-  More  general  issues  on  testing  linear 
models  are  addressed  in  Chap.  7. 


The  AN  OVA  Model  in  Matrix  Notation 


The  simple  ANOVA  problem  (Sect.  3.5)  may  also  be  rewritten  in  matrix  terms. 
Recall  the  definition  of  a  vector  of  ones  from  (2.1)  and  define  a  vector  of  zeros 
as  0n.  Then  construct  the  following  (n  x  p)  matrix  (here  p  —  3), 


(3.54) 


where  m  —  10.  Equation  (3.41)  then  reads  as  follows. 

The  parameter  vector  is  /3  =  (/zi ,  /Z2,  P3)T  -  The  data  set  from  Example  3.14  can 
therefore  be  written  as  a  linear  model  y  =  A  +  s  where  y  e  W  with  n  —  m  •  p 
is  the  stacked  vector  of  the  columns  of  Table  3.1.  The  projection  into  the  column 
space  C(X)  of  (3.54)  yields  the  least-squares  estimator  /3  =  X)~l y .  Note 
that(A,TA)“1  =  (1/10)23  and  that  XT y  —  (106, 124, 1 5 1)T  is  the  sum  J2k=i  ykj 
for  each  factor,  i.e.  the  three  column  sums  of  Table  3.1.  The  least  squares  estimator 
is  therefore  the  vector  Phx  —  (Ai>  Pi,  pA  —  (10.6, 12.4, 15.1) 1  of  sample  means 
for  each  factor  level  j  —  1, 2,  3.  Under  the  null  hypothesis  of  equal  mean  values 
p i  =  p2  —  M3  —  M,  we  estimate  the  parameters  under  the  same  constraints.  This 
can  be  put  into  the  form  of  a  linear  constraint: 


—p  i  +  p2  —  0 

— pi  T  pi  —  0. 


This  can  be  written  as  Aj5  —  a ,  where 


3.6  Multiple  Linear  Model 
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and 


The  constrained  least- squares  solution  can  be  shown  (Exercise  3.24)  to  be  given  by: 

Ph0  =  Pm  ~  C XTX)-lAT{A(XTXylATrl(ApHl  -  a).  (3.55) 

It  turns  out  that  (3.55)  amounts  to  simply  calculating  the  overall  mean  y  —  12.7  of 

A  —i— 

the  response  variable  y:  /3h0  —  (12.7, 12.7, 12.7)  . 

The  E-test  that  has  already  been  applied  in  Example  3.14  can  be  written  as 

r  {II  y  -  xPh0\\2  - 1  It  -  xPhi\\2}/2  ^ 

r  —  - * -  (3.56) 

I  It  —  1 12/ 27 

which  gives  the  same  significant  value  8.78.  Note  that  again  we  compare  the  RSS h0 
of  the  reduced  model  to  the  RSS hx  of  the  full  model.  It  corresponds  to  comparing 
the  lengths  of  projections  into  different  column  spaces.  This  general  approach  in 
testing  linear  models  is  described  in  detail  in  Chap.  7. 


ui* j  . 


'  Summary 

^  The  relation  y  —  Xfi  +  e  models  a  linear  relation  between  a  one¬ 
dimensional  variable  Y  and  a  -dimensional  variable  X .  Vy  gives 
the  best  linear  regression  fit  of  the  vector  y  onto  C(X).  The  least 
squares  parameter  estimator  is  /3  =  (XTX)~lXTy. 

^  The  simple  ANOVA  model  can  be  written  as  a  linear  model. 

The  ANOVA  model  can  be  tested  by  comparing  the  length  of  the 
projection  vectors. 

^  The  test  statistic  of  the  E-test  can  be  written  as 

{||y  -  xpHo\\2  -\\y-  xPm\\2}l{df(r)  -  df(f)} 

\\y  - xpHl\\2/df(f) 


The  adjusted  coefficient  of  determination  is 


Pi 1  -  r 2) 
n  —  (p  +  1)' 
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3.7  Boston  Housing 

The  main  statistics  presented  so  far  can  be  computed  for  the  data  matrix  X  (506  x  14) 
from  our  Boston  Housing  data  set.  The  sample  means  and  the  sample  medians 
of  each  variable  are  displayed  in  Table  3.3.  The  table  also  provides  the  unbiased 
estimates  of  the  variance  of  each  variable  and  the  corresponding  standard  deviations. 
The  comparison  of  the  means  and  the  medians  confirms  the  asymmetry  of  the 
components  of  X  that  was  pointed  out  in  Sect.  1.9. 

The  (unbiased)  sample  covariance  matrix  is  given  by  the  following  (14  x  14) 
matrix  Sn : 


/ 

73.99 

-40.22 

23.99- 

-0.12 

0.42 

-1.33 

85.41 

-6.88 

46.85 

844.82 

5.40 

-302.38 

27.99 

-30.72 

-40.22 

543.94 

-85.41  - 

-0.25 

-1.40 

5.11 

-373.90 

32.63 

-63.35 

-1236.45 

-19.78 

373.72 

-68.78 

77.32 

23.99 

-85.41 

47.06 

0.11 

0.61 

-1.89 

124.51 

-10.23 

35.55 

833.36 

5.69 

-223.58 

29.58 

-30.52 

—0.12 

-0.25 

0.11 

0.06 

0.00 

0.02 

0.62 

-0.05 

-0.02 

-1.52 

-0.07 

1.13 

-0.10 

0.41 

0.42 

-1.40 

0.61 

0.00 

0.01 

-0.02 

2.39 

-0.19 

0.62 

13.05 

0.05 

-4.02 

0.49 

-0.46 

-1.33 

5.11 

-1.89 

0.02 

-0.02 

0.49 

-4.75 

0.30 

-1.28 

-34.58 

-0.54 

8.22 

-3.08 

4.49 

85.41 

-373.90 

124.51 

0.62 

2.39 

-4.75 

792.36 

-44.33 

111.77 

2402.69 

15.94 

-702.94 

121.08 

-97.59 

-6.88 

32.63 

-10.23- 

-0.05 

-0.19 

0.30 

-44.33 

4.43 

-9.07 

-189.66 

-1.06 

56.04 

-7.47 

4.84 

46.85 

-63.35 

35.55- 

-0.02 

0.62 

-1.28 

111.77 

-9.07 

75.82 

1335.76 

8.76 

-353.28 

30.39 

-30.56 

844.82 

-1236.45 

833.36- 

-1.52 

13.05 

-34.58 

2402.69 

-189.66 

1335.76 

28404.76 

168.15 

-6797.91 

654.71 

-726.26 

5.40 

-19.78 

5.69- 

-0.07 

0.05 

-0.54 

15.94 

-1.06 

8.76 

168.15 

4.69 

-35.06 

5.78 

-10.11 

-302.38 

373.72 

-223.58 

1.13 

-4.02 

8.22 

-702.94 

56.04 

-353.28 

-6797.91 

-35.06 

8334.75 

-238.67 

279.99 

27.99 

-68.78 

29.58- 

-0.10 

0.49 

-3.08 

121.08 

-7.47 

30.39 

654.71 

5.78 

-238.67 

50.99 

-48.45 

V 

-30.72 

77.32 

-30.52 

0.41 

-0.46 

4.49 

-97.59 

4.84 

-30.56 

-726.26 

-10.11 

279.99 

-48.45 

84.59 

and  the  corresponding  correlation  matrix  7^(14  x  14)  is: 


Table  3.3  Descriptive 
statistics  for  the  Boston 
Housing  data  set  Q 
MVAdescbh 


X 

X 

Median  (X) 

Var(X) 

Std(X) 

V 

3.61 

0.26 

73.99 

8.60 

*2 

11.36 

0.00 

543.94 

23.32 

x3 

11.14 

9.69 

47.06 

6.86 

x4 

0.07 

0.00 

0.06 

0.25 

x5 

0.55 

0.54 

0.01 

0.12 

X6 

6.28 

6.21 

0.49 

0.70 

Xi 

68.57 

77.50 

792.36 

28.15 

Xi 

3.79 

3.21 

4.43 

2.11 

x9 

9.55 

5.00 

75.82 

8.71 

*10 

408.24 

330.00 

28,405.00 

168.54 

xn 

18.46 

19.05 

4.69 

2.16 

Xu 

356.67 

391.44 

8,334.80 

91.29 

Xu 

12.65 

11.36 

50.99 

7.14 

*14 

22.53 

21.20 

84.59 

9.20 

3.7  Boston  Housing 
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/  1.00-0.20  0.41  -0.06  0.42-0.22  0.35-0.38  0.63  0.58  0.29-0.39  0.46-0.39X 

-0.20  1.00-0.53-0.04-0.52  0.31  -0.57  0.66-0.31  -0.31  -0.39  0.18-0.41  0.36 

0.41  -0.53  1.00  0.06  0.76-0.39  0.64-0.71  0.60  0.72  0.38-0.36  0.60-0.48 

-0.06-0.04  0.06  1.00  0.09  0.09  0.09-0.10-0.01  -0.04-0.12  0.05-0.05  0.18 

0.42-0.52  0.76  0.09  1.00-0.30  0.73-0.77  0.61  0.67  0.19-0.38  0.59-0.43 

-0.22  0.31  -0.39  0.09-0.30  1.00-0.24  0.21  -0.21  -0.29-0.36  0.13-0.61  0.70 

0.35-0.57  0.64  0.09  0.73-0.24  1.00-0.75  0.46  0.51  0.26-0.27  0.60-0.38 

-0.38  0.66-0.71  -0.10-0.77  0.21  -0.75  1.00-0.49-0.53-0.23  0.29-0.50  0.25 

0.63-0.31  0.60-0.01  0.61  -0.21  0.46-0.49  1.00  0.91  0.46-0.44  0.49-0.38 

0.58-0.31  0.72-0.04  0.67-0.29  0.51-0.53  0.91  1.00  0.46-0.44  0.54-0.47 

0.29-0.39  0.38-0.12  0.19-0.36  0.26-0.23  0.46  0.46  1.00-0.18  0.37-0.51 

-0.39  0.18-0.36  0.05-0.38  0.13-0.27  0.29-0.44-0.44-0.18  1.00-0.37  0.33 

0.46-0.41  0.60-0.05  0.59-0.61  0.60-0.50  0.49  0.54  0.37-0.37  1.00-0.74 

\  —0.39  0.36-0.48  0.18-0.43  0.70-0.38  0.25-0.38-0.47-0.51  0.33-0.74  1.00/ 

Analysing  1Z  confirms  most  of  the  comments  made  from  examining  the  scatterplot 
matrix  in  Chap.  1.  In  particular,  the  correlation  between  X\4  (the  value  of  the  house) 
and  all  the  other  variables  is  given  by  the  last  row  (or  column)  of  1Z.  The  highest 
correlations  (in  absolute  values)  are  in  decreasing  order  Z13,  X$,  X\\ ,  etc. 

Using  the  Fisher’s  Z-transform  on  each  of  the  correlations  between  X\4  and  the 
other  variables  would  confirm  that  all  are  significantly  different  from  zero,  except 
the  correlation  between  X\4  and  X4  (the  indicator  variable  for  the  Charles  River). 
We  know,  however,  that  the  correlation  and  Fisher’s  Z-transform  are  not  appropriate 
for  binary  variable. 

The  same  descriptive  statistics  can  be  calculated  for  the  transformed  variables 
(transformations  were  motivated  in  Sect.  1.9).  The  results  are  given  in  Table  3.4 
and  as  can  be  seen,  most  of  the  variables  are  now  more  symmetric.  Note  that  the 


Table  3.4  Descriptive 
statistics  for  the  Boston 
Housing  data  set  after  the 
transformation  Q 
MVAdescbh 


X 

X 

Median  (X) 

Var(X) 

Std(X) 

Xi 

-0.78 

-1.36 

4.67 

2.16 

Z2 

1.14 

0.00 

5.44 

2.33 

*3 

2.16 

2.27 

0.60 

0.78 

Z4 

0.07 

0.00 

0.06 

0.25 

*5 

-0.61 

-0.62 

0.04 

0.20 

*6 

1.83 

1.83 

0.01 

0.11 

*7 

5.06 

5.29 

12.72 

3.57 

1.19 

1.17 

0.29 

0.54 

xg 

1.87 

1.61 

0.77 

0.87 

*10 

5.93 

5.80 

0.16 

0.40 

An 

2.15 

2.04 

1.86 

1.36 

X12 

3.57 

3.91 

0.83 

0.91 

X13 

3.42 

3.37 

0.97 

0.99 

Xu 

3.03 

3.05 

0.17 

0.41 
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covariances  and  the  correlations  are  sensitive  to  these  non-linear  transformations. 
For  example,  the  correlation  matrix  is  now 

/  1.00-0.52  0.74  0.03  0.81  -0.32  0.70-0.74  0.84  0.81  0.45-0.48  0.62 -0.57  \ 

-0.52  1.00-0.66-0.04-0.57  0.31  -0.53  0.59-0.35-0.31  -0.35  0.18-0.45  0.36 

0.74-0.66  1.00  0.08  0.75-0.43  0.66-0.73  0.58  0.66  0.46-0.33  0.62-0.55 

0.03-0.04  0.08  1.00  0.08  0.08  0.07-0.09  0.01  -0.04-0.13  0.05-0.06  0.16 

0.81  -0.57  0.75  0.08  1.00-0.32  0.78-0.86  0.61  0.67  0.34-0.38  0.61  -0.52 

-0.32  0.31  -0.43  0.08-0.32  1.00-0.28  0.28-0.21  -0.31  -0.32  0.13-0.64  0.61 

0.70-0.53  0.66  0.07  0.78-0.28  1.00-0.80  0.47  0.54  0.38-0.29  0.64-0.48 

-0.74  0.59-0.73-0.09-0.86  0.28-0.80  1.00-0.54-0.60-0.32  0.32-0.56  0.41 

0.84-0.35  0.58  0.01  0.61  -0.21  0.47-0.54  1.00  0.82  0.40-0.41  0.46-0.43 

0.81  -0.31  0.66-0.04  0.67-0.31  0.54-0.60  0.82  1.00  0.48-0.43  0.53-0.56 

0.45-0.35  0.46-0.13  0.34-0.32  0.38-0.32  0.40  0.48  1.00-0.20  0.43-0.51 

-0.48  0.18-0.33  0.05-0.38  0.13-0.29  0.32-0.41  -0.43-0.20  1.00-0.36  0.40 

0.62-0.45  0.62-0.06  0.61  -0.64  0.64-0.56  0.46  0.53  0.43-0.36  1.00-0.83 

\  —0.57  0.36-0.55  0.16-0.52  0.61  -0.48  0.41  -0.43-0.56-0.51  0.40-0.83  1.00/ 

Notice  that  some  of  the  correlations  between  X 14  and  the  other  variables  have 
increased. 

If  we  want  to  explain  the  variations  of  the  price  Xu  by  the  variation  of  all  the 
other  variables  X\ , . . . ,  X^  we  could  estimate  the  linear  model 

13 

X 14  =  /3q  +  ^  '  fij X j  +  s.  (3.57) 

The  result  is  given  in  Table  3.5. 


Table  3.5  Linear  regression 
results  for  all  variables  of 
Boston  Housing  data  set  Q 
MVAlinregbh 


Variable 

■A. 

Pi 

SEG6,) 

t 

p- Value 

Constant 

4.1769 

0.3790 

11.020 

0.0000 

*1 

-0.0146 

0.0117 

-1.254 

0.2105 

x2 

0.0014 

0.0056 

0.247 

0.8051 

*3 

-0.0127 

0.0223 

-0.570 

0.5692 

0.1100 

0.0366 

3.002 

0.0028 

*5 

-0.2831 

0.1053 

-2.688 

0.0074 

*6 

0.4211 

0.1102 

3.822 

0.0001 

*7 

0.0064 

0.0049 

1.317 

0.1885 

*8 

-0.1832 

0.0368 

-4.977 

0.0000 

*9 

0.0684 

0.0225 

3.042 

0.0025 

*io 

-0.2018 

0.0484 

-4.167 

0.0000 

III 

-0.0400 

0.0081 

-4.946 

0.0000 

Xn 

0.0445 

0.0115 

3.882 

0.0001 

X 13 

-0.2626 

0.0161 

-16.320 

0.0000 
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The  value  of  r2  (0.765)  and  r2dj  (0.759)  show  that  most  of  the  variance  of  Xu  is 
explained  by  the  linear  model  (3.57). 

Again  we  see  that  the  variations  of  X  u  are  mostly  explained  by  (in  decreasing 
order  of  the  absolute  value  of  the  £ -statistic)  Ai3,A8,Aii,Aio,Ai2,A6,A9,A4 
and  A5.  The  other  variables  X\,  X 2,  X 3  and  A 7  seem  to  have  little  influence  on 
the  variations  of  A 14.  This  will  be  confirmed  by  the  testing  procedures  that  will  be 
developed  in  Chap.  7. 


3.8  Exercises 

Exercise  3.1  The  covariance  SxAx5  between  X4  and  A5  for  the  entire  bank  data 
set  is  positive.  Given  the  definitions  of  X4  and  X5,  we  would  expect  a  negative 
covariance.  Using  Fig.  3.1  can  you  explain  why  SxAxs  A  positive? 

Exercise  3.2  Consider  the  two  sub-clouds  of  counterfeit  and  genuine  bank  notes  in 
Fig.  3. 1  separately.  Do  you  still  expect  SxAx5  ( now  calculated  separately  for  each 
cloud)  to  be  positive? 

Exercise  3.3  We  remarked  that  for  two  normal  random  variables,  zero  covariance 
implies  independence.  Why  does  this  remark  not  apply  to  Example  3.4? 

Exercise  3.4  Compute  the  covariance  between  the  variables 

X2  —  miles  per  gallon, 

Ag  =  weight 

from  the  car  data  set  ( Table  22.3 ).  What  sign  do  you  expect  the  covariance  to  have? 

Exercise  3.5  Compute  the  correlation  matrix  of  the  variables  in  Example  3.2. 
Comment  on  the  sign  of  the  correlations  and  test  the  hypothesis 

Px  ix2  =  0. 

Exercise  3.6  Suppose  you  have  observed  a  set  of  observations  {Xj  }"=1  with  x  =  0, 
s^x  =  1  and  n~l  YTi=\(xi  ~  x)3  —  0-  Define  the  variable  y\  —  xf.  Can  you 
immediately  tell  whether  rXy  0? 

Exercise  3.7  Find  formulas  (3.29)  and  (3.30)  for  a  and  /3  by  differentiating  the 
objective  function  in  (3.28)  w.r.t.  a  and  /3 . 

Exercise  3.8  How  many  sales  does  the  textile  manager  expect  with  a  “classic  blue  ” 
pullover  price  of  x  —  105? 

Exercise  3.9  What  does  a  scatterplot  of  two  random  variables  look  like  for  r2  —  1 
and  r2  —  0? 
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Exercise  3.10  Prove  the  variance  decomposition  (3.38)  and  show  that  the  coeffi¬ 
cient  of  determination  is  the  square  of  the  simple  correlation  between  X  and  Y . 

A 

Exercise  3.11  Make  a  boxplotfor  the  residuals  Si  —  yt  —  a  —  fixi  for  the  “classic 
blue  ”  pullovers  data.  If  there  are  outliers,  identify  them  and  run  the  linear  regression 
again  without  them.  Do  you  obtain  a  stronger  influence  of  price  on  sales? 

Exercise  3.12  Under  what  circumstances  would  you  obtain  the  same  coefficients 
from  the  linear  regression  lines  ofY  on  X  and  of  X  on  Y  ? 

Exercise  3.13  Treat  the  design  of  Example  3.14  as  if  there  were  thirty  shops  and 
not  ten.  Define  Xj  as  the  index  of  the  shop,  i.e.  Xi  =  i,i  —  1,2, ... ,  30.  The 
null  hypothesis  is  a  constant  regression  line,  EY  —  pi.  What  does  the  alternative 
regression  curve  look  like  ? 

Exercise  3.14  Perform  the  test  in  Exercise  3.13  for  the  shop  example  with  a  0.99 
significance  level.  Do  you  still  reject  the  hypothesis  of  equal  marketing  strategies? 

Exercise  3.15  Compute  an  approximate  confidence  interval  for  pxxx4  In  Exam¬ 
ple  3.2.  Hint:  start  from  a  confidence  interval  for  tanh-1  (pxixf)  and  then  apply 
the  inverse  transformation. 


Exercise  3.16  In  Example  3.2,  using  the  exchange  rate  of  1  EUR  —  106  JPY, 
compute  the  same  empirical  covariance  using  prices  in  Japanese  Yen  rather  than 
in  Euros.  Is  there  a  significant  difference?  Why? 

Exercise  3.17  Why  does  the  correlation  have  the  same  sign  as  the  covariance? 
Exercise  3.18  Show  thatxwcNJfK)  —  tr (EL)  —  n  —  1. 

Exercise  3.19  Show  that  X*  —  TLXV~ l^2  is  the  standardised  data  matrix,  i.e. 
x*  =  0  and  Sx*  —  Ylx- 

Exercise  3.20  Compute  for  the  pullovers  data  the  regression  of  X\  on  X2,  X 3  and 
of  X\  on  X2,  X4.  Which  one  has  the  better  coefficient  of  determination? 

Exercise  3.21  Compare  for  the  pullovers  data  the  coefficient  of  determination  for 
the  regression  of  X\  on  X2  ( Example  3.11),  ofX\  on  X2,  X3  ( Exercise  3.20)  and  of 
X\  on  X2,  X3,  X4  ( Example  3.15).  Observe  that  this  coefficient  is  increasing  with 
the  number  of  predictor  variables.  Is  this  always  the  case? 


Exercise  3.22  Consider  the  AN OVA  problem  (Sect.  3.5)  again.  Establish  the  con¬ 
straint  Matrix  A  for  testing  {i\  —  p 2-  Test  this  hypothesis  via  an  analog  of  (3.55) 
and  (3.56). 

Exercise  3.23  Prove  (3.52).  (Hint,  let  /(/ 3)  —  (y  —  xfi)T (y  —  xfi )  and  solve 

dm  _  a  i 
—  u-/ 

Exercise  3.24  Consider  the  linear  model  Y  =  X p  +  s  where  B  —  argmins  s  is 

p 

yv 

subject  to  the  linear  constraints  Af>  —  a  where  A(q  x  p),(q  <  p)  is  of  rank  q  and 
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a  is  of  dimension  (< q  x  1).  Show  that  ft  =  Pols  ~  (WT  W)~l  AT  (A(WT  X)~l  AT) 
{^APols  —  where  Pols  —  (WT  W)~l  XT  y.  (Hint,  let  f(P,  A)  =  (y  —  x/3)T (y  — 
xP )  —  XT (A P  —  a)  where  X  eRq  and  solve  =  0  and  =  0.) 

Exercise  3.25  Compute  the  covariance  matrix  S  —  Cov(X)  where  X  denotes  the 
matrix  of  observations  on  the  counterfeit  bank  notes.  Make  a  Jordan  decomposition 
ofS.  Why  are  all  of  the  eigenvalues  positive? 

Exercise  3.26  Compute  the  covariance  of  the  counterfeit  notes  after  they  are 
linearly  transformed  by  the  vector  a  —  (1,  1,  1,  1,  1,  1)T. 


Chapter  4 

Multivariate  Distributions 


The  preceding  chapter  showed  that  by  using  the  two  first  moments  of  a  multivariate 
distribution  (the  mean  and  the  covariance  matrix),  a  lot  of  information  on  the 
relationship  between  the  variables  can  be  made  available.  Only  basic  statistical 
theory  was  used  to  derive  tests  of  independence  or  of  linear  relationships.  In  this 
chapter  we  give  an  introduction  to  the  basic  probability  tools  useful  in  statistical 
multivariate  analysis. 

Means  and  covariances  share  many  interesting  and  useful  properties,  but  they 
represent  only  part  of  the  information  on  a  multivariate  distribution.  Section  4.1 
presents  the  basic  probability  tools  used  to  describe  a  multivariate  random  variable, 
including  marginal  and  conditional  distributions  and  the  concept  of  independence. 
In  Sect.  4.2,  basic  properties  on  means  and  covariances  (marginal  and  conditional 
ones)  are  derived. 

Since  many  statistical  procedures  rely  on  transformations  of  a  multivariate 
random  variable,  Sect.  4.3  proposes  the  basic  techniques  needed  to  derive  the 
distribution  of  transformations  with  a  special  emphasis  on  linear  transforms.  As 
an  important  example  of  a  multivariate  random  variable,  Sect.  4.4  defines  the 
multinormal  distribution.  It  will  be  analysed  in  more  detail  in  Chap.  5  along 
with  most  of  its  “companion”  distributions  that  are  useful  in  making  multivariate 
statistical  inferences. 

The  normal  distribution  plays  a  central  role  in  statistics  because  it  can  be  viewed 
as  an  approximation  and  limit  of  many  other  distributions.  The  basic  justification 
relies  on  the  central  limit  theorem  presented  in  Sect.  4.5.  We  present  this  central 
theorem  in  the  framework  of  sampling  theory.  A  useful  extension  of  this  theorem  is 
also  given:  it  is  an  approximate  distribution  to  transformations  of  asymptotically 
normal  variables.  The  increasing  power  of  computers  today  makes  it  possible 
to  consider  alternative  approximate  sampling  distributions.  These  are  based  on 
resampling  techniques  and  are  suitable  for  many  general  situations.  Section  4.8 
gives  an  introduction  to  the  ideas  behind  bootstrap  approximations. 
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4  Multivariate  Distributions 


4.1  Distribution  and  Density  Function 


Let  X  —  (X\,  X2, . . . ,  Xp )T  be  a  random  vector.  The  cumulative  distribution 
function  (cdf)  of  X  is  defined  by 


F(x)  —  P(X  <  x)  —  P(Xi  <  x\ ,  X2  <  X2 , . . . ,  Xp  <  Xp) . 

For  continuous  X,  a  nonnegative  probability  density  function  (pdf)  /  exists  that 


F(x)  =  I  f(u)du. 

J—oo 


(4.1) 


Note  that 


/OO 

-OO 


/  ( u )  du  —  1 


Most  of  the  integrals  appearing  below  are  multidimensional.  For  instance, 

f*  x  f*  x  rx  1 

J_  f  ( u)du  means  J f(u ,  up)du\  . . .  dup.  Note  also  that  the  cdf 
F  is  differentiable  with 


fix)  = 


dp  F(x) 
dx\ • •  •  dx 


p 


For  discrete  X,  the  values  of  this  random  variable  are  concentrated  on  a  countable 
or  finite  set  of  points  {cjjjej,  the  probability  of  events  of  the  form  {X  e  D}  can 
then  be  computed  as 


P  (XeD)=  J2  p(*  =  c4- 

{j’.CjED} 


If  we  partition  X  as  X  =  (Xi,X2)t  with  X\  e  Rk  and  X2  G  Rp  k,  then  the 
function 


Fxi(xi)  =  P(Xi  <  xi)  =  F(xn,...,xik,oo,...,od)  (4.2) 

is  called  the  marginal  cdf.  F  —  F(x)  is  called  the  joint  cdf.  For  continuous  X 
the  marginal  pdf  can  be  computed  from  the  joint  density  by  “integrating  out”  the 
variable  not  of  interest. 


/OO 

-00 


/ (x\ ,  X2)dX2. 


(4.3) 
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The  conditional  pdf  of  X2  given  X\  =  X\  is  given  as 


fix 2  I  xi)  = 


fix  1,X2) 
fx  1  (Xl) 


Example  4. 1  Consider  the  pdf 


(4.4) 


/(x1,x2)  =  jhi  +  N 
/(xi,x2)  is  a  density  since 


0  <  xi,x2  <  1, 

otherwise. 


f  x  1 

~  x\~ 
_  2  _ 

1  3 

~  xl " 

/  / (xi ,  x2)dx\dx2  =  - 

o+2 

2 

_  2  _ 

The  marginal  densities  are 


fx  i(*i)  = 


fx2(*2)  = 


J  1  3 
tto  =  -xi  -| —  ; 

2  4 


.  3  1 

cix  |  —  — x?  +  — 

2  4 


The  conditional  densities  are  therefore 


/(x2  I  Xl)  = 


\x\  +  §X2 

5X1  +  | 


and 


fix  1  |  X2)  = 


5X1  +  |x2 

|x2  +  5 


Note  that  these  conditional  pdf’s  are  nonlinear  in  x\  and  X2  although  the  joint  pdf 
has  a  simple  (linear)  structure. 

Independence  of  two  random  variables  is  defined  as  follows. 

Definition 4.1  X\  and  X2  are  independent  iff  /(x)  =  f(x i,x2)  = 

fx  1  (x\)fx2  (x2). 

That  is,  X\  and  X2  are  independent  if  the  conditional  pdf’s  are  equal  to  the 
marginal  densities,  i.e.  /(x  1  |  X2)  =  fx i(xi)  and  /(x 2  \  x\)  —  fx2(x 2)- 
Independence  can  be  interpreted  as  follows:  knowing  X2  —  x2  does  not  change  the 
probability  assessments  on  X\ ,  and  conversely. 


Different  joint  pdf’s  may  have  the  same  marginal  pdf’s. 
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Example  4.2  Consider  the  pdf ’s 


fix x,x2)  =  1,  0  <  xux2  <  1, 


and 


/ (jCi ,  X2)  =  1  +  0i(2x\  —  l){2\2  —  1),  0  <  X\,  X2  <  1,  — 1  <  Of  <  1. 

We  compute  in  both  cases  the  marginal  pdf’s  as 


fxx  Oi)  =  1,  fx2ixi)  =  1- 


Indeed 


1 

1  +  af(2xi  —  1)(2x2  —  l)<ix2  =  1  +  a  (2*1  —  l)^  —  X2]q  =  1. 

Hence  we  obtain  identical  marginals  from  different  joint  distributions. 

Let  us  study  the  concept  of  independence  using  the  bank  notes  example.  Consider 
the  variables  X4  (lower  inner  frame)  and  X5  (upper  inner  frame).  From  Chap.  3,  we 
already  know  that  they  have  significant  correlation,  so  they  are  almost  surely  not 

/V  /V 

independent.  Kernel  estimates  of  the  marginal  densities,  fx4  and  fx5,  are  given  in 
Fig.  4.1.  In  Fig.  4.2  (left)  we  show  the  product  of  these  two  densities.  The  kernel 
density  technique  was  presented  in  Sect.  1.3.  If  X4  and  X5  are  independent,  this 

/V  /V  /V 

product  fx4  •  fx5  should  be  roughly  equal  to  / (*4,  *5),  the  estimate  of  the  joint 
density  of  (X4,X5).  Comparing  the  two  graphs  in  Fig.  4.2  reveals  that  the  two 
densities  are  different.  The  two  variables  X4  and  X5  are  therefore  not  independent. 

An  elegant  concept  of  connecting  marginals  with  joint  cdfs  is  given  by  copulae. 
Copulae  are  important  in  Value-at-Risk  calculations  and  are  an  essential  tool  in 
quantitative  finance  (Hardle,  Hautsch,  &  Overbeck,  2009). 

For  simplicity  of  presentation  we  concentrate  on  the  p  —  2  dimensional  case. 
A  two-dimensional  copula  is  a  function  C  :  [0,  l]2  ->  [0,  1]  with  the  following 
properties: 

•  For  every  u  e  [0,  1]:  C(0,  u )  =  C(u,  0)  =  0. 

•  For  every  u  e  [0,  1]:  C(u ,  1 )  —  u  and  C(l,  u)  —  u. 

•  For  every  (u\,  U2),  (iq,  tq)  €  [0,  1]  x  [0, 1]  with  u\  <  V\  and  U2  <  V2 : 


C(v i,v2)  -  C(vi,u2)  -  C(u i,v2)  +  C(ui,u2)  >  0 . 


The  usage  of  the  name  “copula”  for  the  function  C  is  explained  by  the  following 
theorem. 
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Swiss  Bank  Notes 


Fig.  4.1  Univariate  estimates  of  the  density 
MVAdenbank2 
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Swiss  Bank  Notes 


X4  (left)  and  X5  (right)  of  the  bank  notes  Q 


Fig.  4.2  The  product  of  univariate  density  estimates  (left)  and  the  joint  density  estimate  (right)  for 
X4  (left)  and  X5  of  the  bank  notes  O  MVAdenbank3 


Theorem  4.1  (Sklar’s  Theorem)  Let  F  be  a  joint  distribution  function  with 
marginal  distribution  functions  Fxx  and  Fx2  •  Then  a  copula  C  exists  with 


F(x ux2)  =  C{FXl(x i),  FX2(x2)} 


(4.5) 
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for  every  X\,X2  E  R.  If  Fxx  and  Fx2  are  continuous,  then  C  is  unique.  On  the  other 
hand,  if  C  is  a  copula  and  Fxx  and  Fx 2  are  distribution  functions,  then  the  function 
F  defined  by  (4.5)  is  a  joint  distribution  function  with  marginals  Fxx  and  Fx 2. 

With  Sklar’s  Theorem,  the  use  of  the  name  “copula”  becomes  obvious.  It  was 
chosen  to  describe  “a  function  that  links  a  multidimensional  distribution  to  its  one¬ 
dimensional  margins”  and  appeared  in  the  mathematical  literature  for  the  first  time 
in  Sklar  (1959). 

Example  4.3  The  structure  of  independence  implies  that  the  product  of  the  distri¬ 
bution  functions  Fxx  and  Fx2  equals  their  joint  distribution  function  F , 


F(x ux2)  =  FXl(x i)  •  FX2(x2). 


(4.6) 


Thus,  we  obtain  the  independence  copula  C  —  II  from 

n 

n(«i, . . . .  m„ )  =  J4  m,  . 

/  =  1 

Theorem  4.2  Let  X\  and  X2  be  random  variables  with  continuous  distribution 
functions  Fxx  and  Fx2  and  the  joint  distribution  function  F .  Then  X\  and  X2  are 
independent  if  and  only  if  Cxx,x2  —  n. 

Proof  From  Sklar’s  Theorem  we  know  that  there  exists  an  unique  copula  C  with 

P(Vi  <  Xi,x2  <x2 )  =  F(x i,x2)  =  C{FXi(x\),  Fx2(x2)}  .  (4.7) 

Independence  can  be  seen  using  (4.5)  for  the  joint  distribution  function  F  and  the 
definition  of  IT, 


F(x\,x 2)  —  C{FXl(x\),  FXl(x2)}  —  F Xl(x\) F Xl(x2)  .  (4.8) 

□ 

Example  4.4  The  Gumbel-Hougaard  family  of  copulae  (Nelsen,  1999)  is  given  by 
the  function 


Cq(u,  v)  =  exp 


{(  log +  (-logu)e}1/0 


(4.9) 


The  parameter  0  may  take  all  values  in  the  interval  [1, 00).  The  Gumbel-Hougaard 
copulae  are  suited  to  describe  bivariate  extreme  value  distributions. 

For  0  —  1,  the  expression  (4.9)  reduces  to  the  product  copula,  i.e.  C\(u,  v )  = 
FI (u,v)  =  uv.  For  0  ->  00  one  finds  for  the  Gumbel-Hougaard  copula: 


Cq(u,v) — >min(u,  v)  =  M(u,v), 


4.2  Moments  and  Characteristic  Functions 
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where  the  function  M  is  also  a  copula  such  that  C(u,  v)  <  M(u ,  v)  for  arbitrary 
copula  C .  The  copula  M  is  called  the  Frechet-Hoeffding  upper  bound. 

Similarly,  we  obtain  the  Frechet-Hoeffding  lower  bound  W(u ,  v )  =  ma x(u  + 
i;  —  1,0)  which  satisfies  W(u,  v )  <  C(u,  v )  for  any  other  copula  C . 


1*1*J  Q 


'  Summary 

>  The  cumulative  distribution  function  (cdf)  is  defined  as  F(x)  — 
P(X  <  x). 

>  If  a  probability  density  function  (pdf)  /  exists  then  F(x)  = 
f- oo  f(u)du. 

>  The  pdf  integrates  to  one,  i.e.  f_OQ  f  (x)dx  —  1 . 


Let  X  =  ( Xi,X2)t  be  partitioned  into  sub- vectors  X\  and 
X2  with  joint  cdf  F.  Then  Fxx(x i)  =  P(Zi  <  xi)  is  the 
marginal  cdf  of  X\.  The  marginal  pdf  of  X\  is  obtained  by 
fx i  (xi)  =  /(x i ,  x2)dx2.  Different  joint  pdf’s  may  have  the 

same  marginal  pdf’s. 


The  conditional  pdf  of  X2  given  X\  =  x\  is  defined  as  /(x 2  | 
Y  >  _  fix  1 ,  x2) 

l)~  hM i) 

Two  random  variables  X\  and  X2  are  called  independent  iff 
fix \,x2)  —  fxfxi) /x2(x2).  This  is  equivalent  to  f(x2  \  X\)  — 
fx2ixi)- 

Different  joint  pdf’s  may  have  identical  marginal  pdf’s. 


^  Copula  is  a  function  which  connects  marginals  to  form  joint  cdfs. 


4.2  Moments  and  Characteristic  Functions 
Moments:  Expectation  and  Covariance  Matrix 


If  X  is  a  random  vector  with  density  / (x)  then  the  expectation  of  X  is 


/  xi  f(x)dx\ 


f  xpf  (x)dx  ) 


—  p. 


(4-10) 
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Accordingly,  the  expectation  of  a  matrix  of  random  elements  has  to  be  understood 
component  by  component.  The  operation  of  forming  expectations  is  linear: 

E  (aX  +  /3Y)  —  a  EX  +  /3E7.  (4.11) 

If  A(q  x  p)  is  a  matrix  of  real  numbers,  we  have: 

E(AX)  =  AEX.  (4.12) 

When  X  and  Y  are  independent, 

E(XYT)  =  EXEYJ.  (4.13) 

The  matrix 

Var(X)  =  E  =  E(X  -  fi)(X  -  /i)T  (4.14) 

is  the  (theoretical)  covariance  matrix.  We  write  for  a  vector  X  with  mean  vector  /x 
and  covariance  matrix  E, 


(4.15) 


The  (p  x  q)  matrix 


Sxf  =  Cov(X,  Y )  =  E(X  -  p)(Y  -  v)T 


is  the  covariance  matrix  of  X  ~  (/x,  Ezz)  and  7  ~ 
and  that  Z  =  has  covariance  Ezz  =  ^ 


(y,  Eyy)-  Note  that  Ezz 
^ .  From 


(4.16) 


T 

YX 


Cov(A,  7)  =  E(XYt)  -  /xvT  =  E(XYt)  -EX  EYt  (4.17) 


it  follows  that  Cov(X,  Y)  —  0  in  the  case  where  X  and  Y  are  independent.  We 
often  say  that  /x  =  E(A)  is  the  first  order  moment  of  X  and  that  E(XXT)  provides 
the  second  order  moments  of  X : 


E(XXt)  =  {E (XjXj)},  for  i  —  and  j  —  l, . . . ,  p. 


(4.18) 


Properties  of  the  Covariance  Matrix  £  =  Var(X) 

Z  =  ((TXiXj),  GXiXj  —  Co  v(Xi,Xj),  (JXiXi  —  Var  (Xt) 
E  =  E(AAt)  -  /x/xT 


E  >  0 


(4.19) 

(4.20) 

(4.21) 
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Properties  of  Variances  and  Covariances 

Var(aTX)  =  aTVar(X)a  =  ’^2aiajcrxixj  (4.22) 

ij 

Var  (AX  +  b)  =  *4Var(X).4T  (4.23) 

Cov(X  +  Y,Z)  =  Cov(X,  Z)  +  Co v(Y,  Z)  (4.24) 

Var(X  +  Y)  =  Var(Z)  +  Cov(X,  Y )  +  Cov(F,  X)  +  Var(F)  (4.25) 
Cov(^Z,  BY)  =  A  Cov(X,  Y)Bt.  (4.26) 


Let  us  compute  these  quantities  for  a  specific  joint  density. 

Example  4.5  Consider  the  pdf  of  Example  4.1.  The  mean  vector  / 1  —  (^)  is 
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-n 


Xif(x\ ,  X2)dx\ dx 2  = 


n1  (\  3 

X2  (  -X\  +  -X2 


(\ 

3  3 

,  1 

~  xl " 

1 

3 

- 

+  ~Xl 

dx  2=T 

z 

+ 

— 

z 

V4 

2  ) 

'  4 

L  2  J 

0 

2 

L  3  J 

0 


_  1  1  _  1  +4  _  5 

~~  8  +  2  ~~  8  ~~  8  ’ 

The  elements  of  the  covariance  matrix  are 

aXlXi  =  E^i2  ~  lA  with 

EXf  =  J  J  x\  +  L2J  dx\dx2 


1 

2 


1 


Jo 


3 

+  4 


dx\dx2 


Jo 


3 

8 


o"x2x2  =EI22-  h\  with 


1 

"*2" 

1  3 
T  — 

"x2‘ 

4 

L  3  J 

o  2 

L  4  J 

11 

24 


<jXlx2  =  E(XiX2)  ~  Mi/x2  with 
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1 

6 


1 

3  ’ 


Hence  the  covariance  matrix  is 


/ 0.0815  0.0052  \ 
v  0.0052  0.0677  )  ' 


Conditional  Expectations 


The  conditional  expectations  are 


E(X2  |  xi)  = 


J  X2  /(x2  I  X\)  dx 2 


and 


E(3/  |  X2)  =  /  xi /(xi  |  x2)  r/xi . 

(4.27) 


E(X2|xi)  represents  the  location  parameter  of  the  conditional  pdf  of  X2  given  that 
X\  —  x\.  In  the  same  way,  we  can  define  Var(X2|Xi  =  xi)  as  a  measure  of  the 
dispersion  of  X2  given  that  X\  —  X\.  We  have  from  (4.20)  that 


Var^lV  =  Xl)  =  E(X2  Xj \Xi  -  xi)  -  E(X2\Xi  =  xi)  E(V2'  | ^  =  x{) 


T 


Using  the  conditional  covariance  matrix,  the  conditional  correlations  may  be 
defined  as: 


=  CcN(X2,X3\Xi  =xQ 

PX2X3lXl=xl  yVar(Z2|V  =  *0  Var(V3|X!  =  x0' 

These  conditional  correlations  are  known  as  partial  correlations  between  X2  and  X3 , 
conditioned  on  X\  being  equal  to  x\ . 

Example  4.6  Consider  the  following  pdf 


/(x  i,x2,x3) 


2 

3 


(xi  +  x2  +  X3)  where  0  <  x\ ,  x2,  X3  <  1. 
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Note  that  the  pdf  is  symmetric  in  x\ ,  x2  and  x3  which  facilitates  the  computations. 
For  instance, 


f(xi,x2)  =  §Ol  +  *2  +  \)  0  <  Xi,  x2  <  1 
f(x  1)  =  |(xi  +  1)  0  <  x\  <  1 

and  the  other  marginals  are  similar.  We  also  have 


f{x  i,x2|x3) 


Xi  +  X2  +  X3 
X3  +  1 


0  <  x\ ,  x2  <  1 


X\  T  X3  +  3- 

/(xi|x3)  =  - — +  0  <  X\  <  1. 

x3  +  1 

It  is  easy  to  compute  the  following  moments: 

E(V)  =  |;  E(*2)  =  E {XiXj)  =  £  (i  ?  j  and ij  =  1,2,3) 

E(*i|*3  -  X3)  =  E(x2\x3  =  x3)  =  ±  (jf22)  ; 

E(Z2|Z3  =  x3)  =  E(X2|*3  =x3)  =  ±  (4^4) 


and 

E(XlX2\X3  =  x3)  =  i-2(^). 

Note  that  the  conditional  means  of  X\  and  of  X2,  given  X3  =  x3,  are  not  linear 
in  x3.  From  these  moments  we  obtain: 


1 


324 

13 


162 

1 


324 


1 


324 

1 


324 

13 


162 


in  particular  pxxx2 


1 

- %  —0.0385. 

26 


The  conditional  covariance  matrix  of  X\  and  X2,  given  X3  =  x3  is 


1 2*| +24*3  +  1 1  — \  \ 

144(*3  +  1)2  144(*3  +  1)2  I 

-1  12*3+24*3  +  11  I  ' 

144(*3  +  1)2  144(*3  +  1)2  / 

In  particular,  the  partial  correlation  between  X\  and  X2,  given  that  X3  is  fixed  at  x3, 
is  given  by  pXlx2\x}=x3  =  p  J|(i_n  which  ran8es  from  -0.0909  to  -0.0213 

when  x3  goes  from  0  to  1 .  Therefore,  in  this  example,  the  partial  correlation  may  be 
larger  or  smaller  than  the  simple  correlation,  depending  on  the  value  of  the  condition 
X3  =  x3. 
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Example  4. 7  Consider  the  following  joint  pdf 

/ Cm, *2, *3)  =  2x2(xi  +  *3);  0  <  xi,x2,x3  <  1. 

Note  the  symmetry  of  x\  and  X3  in  the  pdf  and  that  X2  is  independent  of  {X\,  X3). 
It  immediately  follows  that 

/ (xi ,  X3)  =  (xi  +  X3)  0  <  x\ ,  X3  <  1 


fix  1)  =  Xi  +  I; 
f(x2)  =  2x2; 


fix  3)  =  X3  +  I 


Simple  computations  lead  to 


E(X)  = 


Let  us  analyse  the  conditional  distribution  of  ( V ,  X2)  given  W,  =  x3.  We  have 


/(xi,x2|x3)  = 


/(jci  |xc3) 
/(*2|*3) 


=  2 


4(xi  +  X3)x2 
2X3  +  1 

X\  +  X3 


2x3  H-  1 

/  (x2)  =  2x2 


0  <  x\ ,  x2  <  1 

0  <  xi  <  1 
0  <  x2  <  1 


so  that  again  X\  and  X2  are  independent  conditional  on  X3  =  X3.  In  this  case 
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Properties  of  Conditional  Expectations 

Since  E(X2\X]  =  x\)  is  a  function  of  x\,  say  h(x i),  we  can  define  the  random 
variable  h{X i)  =  E(X2\X\).  The  same  can  be  done  when  defining  the  random 
variable  Var(X2  |Xi).  These  two  random  variables  share  some  interesting  properties: 

E(X2)  =  E{E(X2\Xl)}  (4.28) 

Var(X2)  =  E{Var(X2|X0}  +  Var{E(X2|X0}.  (4.29) 

Example  4.8  Consider  the  following  pdf 


_ 22. 

f  (x\ ,  x2)  —  2e  Ai ;  0  <  x\  <  1 ,  x2  >  0. 


It  is  easy  to  show  that 

2  1 

f(x i)  =  2x\  for  0  <  x\  <  1;  E(Xi)  =  -  and  Var(Xi)  =  — 

3  18 

f(x2 |xi)  =  —e-%  for  x2  >  0;  E(X2\Xx)  =  Xx  and  Var(X2|Xi)  =  X?. 

X\ 

Without  explicitly  computing  / (x2),  we  can  obtain: 

E(X1)  =  'E{E(X2\X,)}  =  E(X1)  =  2- 

Var(Z2)  -  E{Var(Z2|Xi)}  +  Var{E(X2|X,)} 

-  E(X2)  +  Var(Z0  =  -  +  -I  =  12. 

1  4  18  18 

The  conditional  expectation  E(X2\X\)  viewed  as  a  function  h(X\)  of  X\  (known 
as  the  regression  function  of  X2  on  X\ ),  can  be  interpreted  as  a  conditional 
approximation  of  X2  by  a  function  of  X\.  The  error  term  of  the  approximation  is 
then  given  by: 


U  =  X2-E(X2\Xl). 


Theorem  4.3  Let  X\  e  M.k  and  X2  e  k  and  U  —  X2  —  E(X2\X\).  Then  we 
have: 

1.  E(U )  =  0 

2.  E(X2\X\)  is  the  best  approximation  of  X2  by  a  function  h{X\)  of  X\  where  h  : 
M.k  — >  ~Rp~k.  “Best”  is  the  minimum  mean  squared  error  (MSE)  sense,  where 


MSE{h)  =  E[{X2  -  h(X0}T  {X2  -  h(X i)}]. 
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Characteristic  Functions 

The  characteristic  function  (cf)  of  a  random  vector  Ielp  (respectively  its  density 
/ (x))  is  defined  as 

cpx(t)  —  E(e'tTx)  —  J  extT x  f{x)  dx ,  t  eRp, 


where  i  is  the  complex  unit:  i2  =  —1.  The  cf  has  the  following  properties: 


<Px( 0)  =  1  and  \<px(t)\  <  1 


(4.30) 


If  (f  is  absolutely  integrable,  i.e.  the  integral  \cp(x)\dx  exists  and  is  finite,  then 


/  0)  = 


1  f°° 

(2tt)p  i_o o 


— i  tT  x 


(px(t)  dt. 


(4.31) 


If  X  —  (Xi,X2,...,X„)t,  then  for  t  =  (ti,t2, 


T 


—  cpx(t\,  0, . . .  ,0), . . .  ,(pxp(tp)  —  cpx(  0, . . . ,  0,  tp) 


(4.32) 


If  X\, . . . ,  Xp  are  independent  random  variables,  then  for  t  —  (t\ ,  t2, . . . ,  tp) 


T 


<Px(0  =  •  •  •  • <Pxp(tp ) 


(4.33) 


If  Xi , . . . ,  Xp  are  independent  random  variables,  then  for  t  e  R 


^Xi+-+X^(0  —  <PXi(t)-  .  .  .-(Pxp(t). 


(4.34) 


The  characteristic  function  can  recover  all  the  cross-product  moments  of  any  order: 
Vjk  >  0,  k  =  1 , . . . ,  p  and  for  t  —  (t\ , . . . ,  ^)T  we  have 


=  -Xft; 


d(px(t) 


_dt  { 


J  1 


dtpp 


Jt=0 


(4.35) 


Example  4.9  The  cf  of  the  density  in  Example  4.5  is  given  by 


<Px(f)  =  [  f  e',Txf(x)dx 

Jo  Jo 

1  r  1 


-If  {cos(0xi  +  t2x2)  +  isin(0xi  +  t2x2)}  (J^xi  +  dx\dx2, 
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0.5 eltl  (3\t\  —  3ielt2 1\  +  —  ielt2  ^  +  t\  ^  0 

0.5  (3it\  —  3ielt2 1\  +  i ^  —  ie1?2  ?2  —  3  eltl  t\  t?) 

O2^2  ' 

Example  4.10  Suppose  X  G  M1  follows  the  density  of  the  standard  normal 
distribution 


/z(x)  =  — ^=ex  p/ 


V2 


7T 


(see  Sect.  4.4)  then  the  cf  can  be  computed  via 


Vx(t) 


=  J-f 

Vin  J- 


00 


it  x 

e  exp 


xA 


dx 


V2 


=  exp 


=  exp 


—00 

==  J  exp  |  —~(x2  —  2i tx  +  i2^2)[  exp  j  - i2t 2 


dx 


since  i2  =  —  1  and  f  -J=  exp  j 


(x— \ty 


1. 


A  variety  of  distributional  characteristics  can  be  computed  from  (px{t ).  The 
standard  normal  distribution  has  a  very  simple  cf,  as  was  seen  in  Example  4.10. 
Deviations  from  normal  covariance  structures  can  be  measured  by  the  deviations 
from  the  cf  (or  characteristics  of  it).  In  Table  4.1  we  give  an  overview  of  the  cf’s  for 
a  variety  of  distributions. 


Theorem  4.4  (Cramer- Wold)  The  distribution  of  X  G  is  completely  deter¬ 
mined  by  the  set  of  all  (one -dimensional)  distributions  oftTX  where  t  G  R^. 


Table  4.1  Characteristic  functions  for  some  common  distributions 


pdf 

cf 

Uniform 

/ (x)  =  I(x  G  [a,  b])/(b  —  a) 

(Px(t)  =  (elht  —  elat)/(b  —  a)\t 

N\  (/x,  cr2) 

f(x)  =  (27rcr2)-1/2exp{— (x  — /x)2/ 2cr2} 

(px{t)  =  e^t-^t2n 

X2(n ) 

/(x)  =  /(x  >  0)xn^2~le~x^2 /{T(n/2)2nt2} 

(px(t)  =  (1  —  2h)"/2 

S) 

/(x)  =  |27rS|_1/2exp{— (x  —  /x)TS(x  —  /x)/2} 

tpxit)  =  glt '  r~t ' ^t/2 
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This  theorem  says  that  we  can  determine  the  distribution  of  X  in  by 
specifying  all  of  the  one-dimensional  distributions  of  the  linear  combinations 

p 

YJtjXi  =  tTx,  t  =  (ti,t2,---,tP)T- 
j= i 


Cumulant  Functions 


Moments  —  f  xk  f(x)dx  often  help  in  describing  distributional  characteristics. 
The  normal  distribution  in  d  =  1  dimension  is  completely  characterised  by  its 
standard  normal  density  /  =  cp  and  the  moment  parameters  are  /x  =  m\  and 
a2  =  m2  —  m\  .  Another  helpful  class  of  parameters  are  the  cumulants  or  semi¬ 
invariants  of  a  distribution.  In  order  to  simplify  notation  we  concentrate  here  on  the 
one-dimensional  (d  =  1)  case. 

For  a  given  one-dimensional  random  variable  X  with  density  /  and  finite 
moments  of  order  k  the  characteristic  function  cpx(t)  —  E(eltX)  has  the  derivative 


J_  |~d7  1  og{(px{t)}~ 

V  _  dt  -i 


The  values  Kj  are  called  cumulants  or  semi-invariants  since  Kj  does  not  change 
(for  j  >  1)  under  a  shift  transformation  X  \->  X  +  a.  The  cumulants  are  natural 
parameters  for  dimension  reduction  methods,  in  particular  the  Projection  Pursuit 
method  (see  Sect.  20.2). 

The  relationship  between  the  first  k  moments  m\, ... ,  and  the  cumulants  is 
given  by 


m  i 
m2 


Kk  =  (-l)'"1 


0 


m 


mk- i 


m  i 


Example  4.11  Suppose  that  k  —  1,  then  formula  (4.36)  above  yields 


(4.36) 


k  i  —  m 
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For  k  —  2  we  obtain 


m  i 


K2  =  ~ 


m2 


=  m2  —  rn 


2 

1* 


For  k  —  3  we  have  to  calculate 


K  3 


mil  0 
m2  mi  1 
m3  m2  2m  1 


Calculating  the  determinant  we  have: 


1  0 

m  1  1 

=  m  1  (2m,  —  m2)  —  m2(2m\)  +  m3 
=  m3  —  3m\m2  +  2m]. 


k2  —  m\ 


mi  1 
m2  2m  1 


m2 


1  0 

m2  2m  1 


+ 


(4.37) 


Similarly  one  calculates 


ic4  —  m4  —  4m3m  1  —  3m\  +  1 2  7772/77]  —  6m]. 


(4.38) 


The  same  type  of  process  is  used  to  find  the  moments  from  the  cumulants: 


m  1  =  k  1 

m2  =  k2  +  k\ 

m3  —  K3  +  3k2k\  +  /c] 

m4  =  7C4  +  4a:3  a:i  +  3  k\  +  6k2k^  +  k\.  (4.39) 

A  very  simple  relationship  can  be  observed  between  the  semi-invariants  and  the 
central  moments  —  E(X  —  /i)k,  where  /i  =  mi  as  defined  before.  In  fact, 
k2  —  fi 2,  k: 3  —  \i 3  and  k4  —  fi4  —  3 ji\. 

Skewness  y2  and  kurtosis  y4  are  defined  as: 

y3  =  E(X  -  iif/o3 
Y4  =  E(X  -  m)4/ct4. 


(4.40) 
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The  skewness  and  kurtosis  determine  the  shape  of  one-dimensional  distributions. 
The  skewness  of  a  normal  distribution  is  0  and  the  kurtosis  equals  3.  The  relation  of 
these  parameters  to  the  cumulants  is  given  by: 


V3 


(4.41) 


From  (4.39)  and  Example  4. 1 1 


V4  = 


K4  +  +  ic\—m\  AT4  +  3  k\ 


G 


Kr 


K4 

—  y  +  3. 
kZ 


(4.42) 


These  relations  will  be  used  later  in  Sect.  20.2  on  Projection  Pursuit  to  determine 
deviations  from  normality. 


Summary 

The  expectation  of  a  random  vector  X  is  ji  —  f  xf(x)  dx,  the 
covariance  matrix  S  =  Var(X)  =  E(X  —  fi)(X  —  /j ) T .  We  denote 
V  ~  (M,  S). 

Expectations  are  linear,  i.e.  E(aX  +  /3Y)  —  aEX  -\-  /3EY  .If  X 
and  Y  are  independent,  then  E(XyT)  =  EIEFT. 

The  covariance  between  two  random  vectors  X  and  Y  is  = 
Cov(X,  Y)  =  E(X  -  E X)(Y  -  E  Y)T  =  E(XYT)  -EXE  YT . 
If  X  and  Y  are  independent,  then  Cov(X,  Y)  =  0. 

^  The  characteristic  function  (cf)  of  a  random  vector  X  is  cpx  (0  = 
E(eitTx). 

The  distribution  of  a  p -dimensional  random  variable  X  is  com¬ 
pletely  determined  by  all  one-dimensional  distributions  of  tT X 
where  t  e  (Theorem  of  Cramer- Wold). 

^  The  conditional  expectation  E(X2  \ X\)  is  the  MSE  best  approxima¬ 
tion  of  X2  by  a  function  of  X\. 


4.3  Transformations 
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4.3  Transformations 


Suppose  that  X  has  pdf  fx(x).  What  is  the  pdf  of  Y  —  3X1  Or  if  X  — 
(X\,  X2,  XAT ,  what  is  the  pdf  of 


3Xi 


Y  =  \  Xi-  4X2 

V  *3 


? 


This  is  a  special  case  of  asking  for  the  pdf  of  Y  when 

X  =  u(Y)  (4.43) 

for  a  one-to-one  transformation  u :  Define  the  Jacobian  of  u  as 

dxj  \  _  (  duj  OQ 
dyj )  V  dyj 

and  let  abs(|  J\)  be  the  absolute  value  of  the  determinant  of  this  Jacobian.  The  pdf 
of  Y  is  given  by 


friy)  =  abs(IJI)  •  fx{u{y)}. 


(4.44) 


Using  this  we  can  answer  the  introductory  questions,  namely 


with 


(xi  ,...,xp)J  =  u(yi,...,yp)  =  ~(yu---,yP) 


1 


T 


J  = 


0\ 


k) 


and  hence  abs(|i7|)  =  (l)P •  So  the  pdf  of  Y  is  ^fx 
This  introductory  example  is  a  special  case  of 


Y  =  AX  +  b ,  where  A  is  nonsingular. 


The  inverse  transformation  is 


X  =  A~\Y  -b). 
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Therefore 

J  =  A~\ 


and  hence 


fY(y)  =  abs(|.A|  x)fx{A  *( y-b )}.  (4.45) 


Example  4.12  Consider  X  =  (Xi,X2)  G  M2  with  density  fx(x)  =  fx(x  i,x2), 


Then 


7  =  +  b 


(X]+X2\ 

\xl-x2) 


and 


Hence 


1-41  =  -2,  absfl^r1)  =  X~,  A~l  =  -U_  J  J 


/rOO  =  abs(|^4|  *)  •  fx(A  ly) 


1-1 


\fx\\{\-\ 


y  1 

T2 


1(1  1 

^fx  -Oi  +  72),  ^ Or  -72) 


(4.46) 


Example  4.13  Consider  IgR1  with  density  fx(x)  and  7  =  exp(X).  According 
to  (4.43)  x  =  w(y)  =  log(y)  and  hence  the  Jacobian  is 


dx  1 

J=T  =  - 

dy  y 


The  pdf  of  7  is  therefore: 


fv(y)  =  —  /a  { log(  v ) ; 

y 


4.4  The  Multinormal  Distribution 
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Summary 


^  If  X  has  pdf  fx(x),  then  a  transformed  random  vector  Y ,  i.e.  X  — 
u(Y ),  has  pdf  fy(y)  —  absQJ’l)  •  fx{u(y )},  where  J  denotes  the 


Jacobian  J  —  ( dfyyi  ^  ^  • 


^  In  the  case  of  a  linear  relation  Y  —  AX  +  b  the  pdf’s  of  X  and  Y 
are  related  via  fy(y)  =  abs(|^4|1)  fx{A~l(y  —  b)}. 

4.4  The  Multinormal  Distribution 

The  multinormal  distribution  with  mean  /x  and  covariance  £  >  0  has  the  density 


fix')  —  |27rS|  ^2exp 


(4.47) 


We  write  X  ~  Np(/jl,  £). 

How  is  this  multinormal  distribution  with  mean  /x  and  covariance  £  related  to 
the  multivariate  standard  normal  Np( 0,1  p)7  Through  a  linear  transformation  using 
the  results  of  Sect.  4.3,  as  shown  in  the  next  theorem. 

Theorem  4.5  Let  X  ~  Np(p,  £)  and  Y  =  £_1/2(X  —  /x)  (Mahalanobis  transfor¬ 
mation).  Then 


Y~NP(0,1P), 


i.e.  the  elements  Yj  e  R  are  independent,  one -dimensional  N{0,  1)  variables. 

Proof  Note  that  ( X  —  /x)T£_1(X  —  /x)  =  YTY .  Application  of  (4.45)  gives  J  — 
£1//2,  hence 


(4.48) 


which  is  by  (4.47)  the  pdf  of  a  Np(0,  Tp). 


□ 
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Note  that  the  above  Mahalanobis  transformation  yields  in  fact  a  random  variable 
Y  =  (Yi, . . . ,  Yp )T  composed  of  independent  one-dimensional  Yj  ~  N\(0, 1)  since 


fy(y)  = 


l 


l 


(27 r)^/2 
p 


exp 


jT_y 


p 


=  FI  frjiyj)- 

j= i 

Here  each  fyj  (  y )  is  a  standard  normal  density  -^=  exp  f  —  V  f  From  this  it  is  clear 
that  E(F)  =  0  and  Var(F)  =  lp. 

How  can  we  create  Np(fi,  E)  variables  on  the  basis  of  Np(0,Xp)  variables?  We 
use  the  inverse  linear  transformation 


V  =  £1/2F  +  /i.  (4.49) 

Using  (4.11)  and  (4.23)  we  can  also  check  that  E(3f)  =  p  and  Var(3f)  =  E.  The 
following  theorem  is  useful  because  it  presents  the  distribution  of  a  variable  after  it 
has  been  linearly  transformed.  The  proof  is  left  as  an  exercise. 

Theorem  4.6  Let  X  ~  Np(fi,  £)  and  A(p  x  p ),  c  e  R^,  where  A  is  nonsingular. 
Then  Y  =  +  c  fs  agam  a  p -variate  Normal,  i.e. 

Y  ~  JVp(.4/i  +  c,  .42F4T).  (4.50) 


Geometry  of  the  Np(/i ,  X)  Distribution 

From  (4.47)  we  see  that  the  density  of  the  Np(p,  £)  distribution  is  constant  on 
ellipsoids  of  the  form 


(x  —  /x)TS  —  —  (4.51) 

Example  4.14  Figure  4.3  shows  the  contour  ellipses  of  a  two-dimensional  normal 
distribution.  Note  that  these  contour  ellipses  are  the  iso-distance  curves  (2.34)  from 
the  mean  of  this  normal  distribution  corresponding  to  the  metric  E_1 . 
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Normal  sample 


Contour  Ellipses 


Fig.  4.3  Scatterplot  of  a  normal  sample  and  contour  ellipses  for  /x  =  ( \ )  and  S  =  ( 5  |'5 )  Q 


MVAcontnorm 


According  to  Theorem  2.7  in  Sect.  2.6  the  half-lengths  of  the  axes  in  the  contour 
ellipsoid  are  >j d2Xj  where  A,  are  the  eigenvalues  of  E.  If  S  is  a  diagonal  matrix, 
the  rectangle  circumscribing  the  contour  ellipse  has  sides  with  length  2 dcij  and  is 
thus  naturally  proportional  to  the  standard  deviations  of  X\  (i  —  1,2). 

The  distribution  of  the  quadratic  form  in  (4.51)  is  given  in  the  next  theorem. 

Theorem  4.7  IfX  ~  Np(/jl,  X),  then  the  variable  U  —  (X  —  /x)TX_1(X  —  /x)  has 
a  x2p  distribution. 

Theorem  4.8  The  characteristic  function  (cf)  of  a  multinormal  Np(/jL,  S)  is 
given  by 


cpx(t)  —  exp  [  i  tT ji  —  -t  1  Tit  )  . 


1 


T 


(4.52) 


We  can  check  Theorem  4.8  by  transforming  the  cf  back: 


/ (x)  =  ^  ^  J  exp  (  — hTx  +  kT/x  —  )  dt 


1 

2 


1 


|27rS-1|1/2|27rI]|1/2 


/ 


exp 


—  +  2hT(x  —  /x)  —  (x  —  /x)TS  1  (x  —  /x)} 
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x  exp 


“{(*-M)T£  1  (*-/*)} 


dt 


\2nY,\ll2 


exp 


~\{(x  -^)TS(x  -/x)} 


since 


/  |2ttE-1|1/2  £ 

=  [  — L 

J  2^-S-1 


1 


—  {tTXt  +  lit 1  (x  —  pi)  —  (x  —  pi) 1  S  X(x  —  /x)} 
3 


:T 


Tr-1 


1 1/2 


exp 


1 


-§{(f  +  is-1  (x  -  m))TS0  +  is-1  (x  -  m 


dt 


dt 


=  1 


Note  that  if  Y  ~  Np(0,Xp)9  then 


cpY(t)  =  exp  -- 


(-X-tTlpt^  =  exp^-i^,2 


=  <PYi(h)  ‘  ‘  <PYp(tp) 


which  is  consistent  with  (4.33). 


Singular  Normal  Distribution 


Suppose  that  we  have  rank(X)  =  k  <  p,  where  p  is  the  dimension  of  X.  We  define 
the  (singular)  density  of  X  with  the  aid  of  the  G  -Inverse  of  E, 


/  O)  = 


(2tt) 


—k/2 


1 


exp  <! — (x  —  /x)TE  {x  —  pi) 


(Ai---A,)V2 


(4.53) 


where 

1 .  x  lies  on  the  hyperplane  A fT  (x  —  pi)  —  0  with  A f(p  x  (p  —  k))  :  A fTX  =  0  and 

NTN  =  xk. 

2.  E_  is  the  G -Inverse  of  E,  and  X\, ...  ,Xk  are  the  nonzero  eigenvalues  of  E. 
What  is  the  connection  to  a  multinormal  with  /: -dimensions?  If 


Y  ~  Nk( 0,  Ai)  and  Ai  =  diag(Ai, . . . ,  A*),  (4.54) 

then  an  orthogonal  matrix  B(pxk)  with  BTB  —  Xk  exists  that  means  X  =  BY  +  pi 
where  X  has  a  singular  pdf  of  the  form  (4.53). 
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Gaussian  Copula 


In  Examples  4.3  and  4.4  we  have  introduced  copulae.  Another  important  copula  is 
the  Gaussian  or  normal  copula , 


Cp(u,  v) 


fp(x\ ,  x2)dx2dx\  , 


(4.55) 


see  Embrechts,  McNeil,  and  Straumann  (1999).  In  (4.55),  fp  denotes  the  bivariate 
normal  density  function  with  correlation  p  for  n  —  2.  The  functions  <I>i  and  O2 
in  (4.55)  refer  to  the  corresponding  one-dimensional  standard  normal  cdfs  of  the 
margins. 

In  the  case  of  vanishing  correlation,  p  —  0,  the  Gaussian  copula  becomes 


fx\  (■ xx)dxi 


r®2  '(D 

J—oo 


fx2(*2)dx 2 


—  uv 


—  Tl(u,  v)  . 


U1«J 


7— SC 


Summary 


The  pdf  of  a  -dimensional  multinormal  A  Np(/i,  S)  is 
/ (x)  =  |2jtS |— */2  exp  I  —  -(x  —  /i)TS-1(x  —  /a)  j 


The  contour  curves  of  a  multinormal  are  ellipsoids  with  half- 
lengths  proportional  to  where  A i  denotes  the  eigenvalues  of 
E  (i  —  1, . . . ,  p). 

^  The  Mahalanobis  transformation  transforms  X  ~  Np(pi,  E)  to 
Y  —  E_1/2(A  —  pi)  ~  Np(0,lp).  Going  in  the  other  direction, 
one  can  create  a  A  ~  Np(fi,  E)  from  7  ~  NP(0,1P )  via  A  = 
E1/2^  +  /x. 

^  If  the  covariance  matrix  E  is  singular  (i.e.  rank(E)  <  p ),  then  it 
defines  a  singular  normal  distribution. 
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Summary  (continued) 
The  Gaussian  copula  is  given  by 


Cp(u,  v) 


fP(xi,x2)dx2dxi  . 


The  density  of  a  singular  normal  distribution  is  given  by 


4.5  Sampling  Distributions  and  Limit  Theorems 

In  multivariate  statistics,  we  observe  the  values  of  a  multivariate  random  variable 
X  and  obtain  a  sample  {xi}ni=v  as  described  in  Chap.  3.  Under  random  sampling, 
these  observations  are  considered  to  be  realisations  of  a  sequence  of  i.i.d.  random 
variables  X\ , . . . ,  Xn,  where  each  X\  is  a  p -variate  random  variable  which  replicates 
the  parent  or  population  random  variable  X .  Some  notational  confusion  is  hard  to 
avoid:  X;  is  not  the  i  th  component  of  X,  but  rather  the  i  th  replicate  of  the  ^-variate 
random  variable  X  which  provides  the  i  th  observation  Xi  of  our  sample. 

For  a  given  random  sample  X\, . . . ,  Xn,  the  idea  of  statistical  inference  is  to 
analyse  the  properties  of  the  population  variable  X .  This  is  typically  done  by 
analysing  some  characteristic  0  of  its  distribution,  like  the  mean,  covariance  matrix, 
etc.  Statistical  inference  in  a  multivariate  setup  is  considered  in  more  detail  in 
Chaps.  6  and  7. 

Inference  can  often  be  performed  using  some  observable  function  of  the  sample 
X\, . . . ,  Xn,  i.e.  a  statistics.  Examples  of  such  statistics  were  given  in  Chap.  3:  the 
sample  mean  x,  the  sample  covariance  matrix  S.  To  get  an  idea  of  the  relationship 
between  a  statistics  and  the  corresponding  population  characteristic,  one  has  to 
derive  the  sampling  distribution  of  the  statistic.  The  next  example  gives  some  insight 
into  the  relation  of  (. x ,  S )  to  (/x,  £). 

Example  4.15  Consider  an  iid  sample  of  n  random  vectors  X,  e  Rp  where 
E(X?)  =  fi  and  Var(X,)  =  E.  The  sample  mean  x  and  the  covariance  matrix 
S  have  already  been  defined  in  Sect.  3.3.  It  is  easy  to  prove  the  following  results: 

E(x)  =  n~l  J2  E(V)  =  M 

i  =  1 

Var(x)  =  n~2  ^  Var(X?)  =  —  E(x  xT)  —  /x/xT 
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E(<S)  =  n~l  E  j  £  (Xi  -  x){Xt  -  x)TJ 

(  n 

—  n~l  E  l  J2  Xi  Xj  —  nxxT 
u  =  i 

=  n~l  {n  (E  +  /x/xT)  —  n  (n_1E  +  /x/xT)} 

=  —  E. 

n 

This  shows  in  particular  that  S  is  a  biased  estimator  of  E.  By  contrast,  Su  —  pfyS 
is  an  unbiased  estimator  of  E. 

Statistical  inference  often  requires  more  than  just  the  mean  and/or  the  variance 
of  a  statistic.  We  need  the  sampling  distribution  of  the  statistics  to  derive  confidence 
intervals  or  to  define  rejection  regions  in  hypothesis  testing  for  a  given  significance 
level.  Theorem  4.9  gives  the  distribution  of  the  sample  mean  for  a  multinormal 
population. 

Theorem  4.9  Let  Xi,...,Xn  be  i.i.d.  with  Xi  ~  Np(/jL,  E).  Then  x  ~ 
Np(fi,n~{  E). 

Proof  x  —  n~l  Y^i= i  Xi  is  a  linear  combination  of  independent  normal  variables, 
so  it  has  a  normal  distribution  (see  Chap.  5).  The  mean  and  the  covariance  matrix 
were  given  in  the  preceding  example.  □ 

With  multivariate  statistics,  the  sampling  distributions  of  the  statistics  are  often 
more  difficult  to  derive  than  in  the  preceding  Theorem.  In  addition  they  might 
be  so  complicated  that  approximations  have  to  be  used.  These  approximations 
are  provided  by  limit  theorems.  Since  they  are  based  on  asymptotic  limits,  the 
approximations  are  only  valid  when  the  sample  size  is  large  enough.  In  spite  of  this 
restriction,  they  make  complicated  situations  rather  simple.  The  following  central 
limit  theorem  shows  that  even  if  the  parent  distribution  is  not  normal,  when  the 
sample  size  n  is  large,  the  sample  mean  x  has  an  approximate  normal  distribution. 

Theorem  4.10  (Central  Limit  Theorem  (CLT))  Let  X\,  X2, . . . ,  Xn  be  i.i.d.  with 
Xi  ~  (/x,  E).  Then  the  distribution  of  +/n(x  —  fi)  is  asymptotically  Np( 0,  E),  i.e. 

£ 

+Jn(x  —  pi)  — >  Np( 0,  E)  as  n  — >  00. 
c 

The  symbol  “ — >”  denotes  convergence  in  distribution  which  means  that  the 
distribution  function  of  the  random  vector  +Jn(x  —  pi)  converges  to  the  distribution 
function  of  ^(0,  E). 


144 


4  Multivariate  Distributions 


Example  4.16  Assume  that  X\,...,Xn  are  i.i.d.  and  that  they  have  Bernoulli 
distributions  where  p  —  \  (this  means  that  P{Xi  =  1)  =  P{Xi  —  0)  =  ^). 

Then  fi  —  P  —  \  an<^  ^  —  P(\  ~  P)  —  Hence, 


c 


»  Afi 


as  n  — >  oo. 


The  results  are  shown  in  Fig.  4.4  for  varying  sample  sizes. 

Example  4.17  Now  consider  a  two-dimensional  random  sample  X\, ... ,  Xn  that  is 
i.i.d.  and  created  from  two  independent  Bernoulli  distributions  with  p  —  0.5.  The 
joint  distribution  is  given  by  P(Xj  —  (0,0)T)  =  P(Xj  —  (0, 1)T)  =  P(Xf  — 
(1,0)T)  =  \,P(Xi  =  (1,1)T)  =  Here  we  have 


Figure  4.5  displays  the  estimated  two-dimensional  density  for  different  sample 
sizes. 

The  asymptotic  normal  distribution  is  often  used  to  construct  confidence  intervals 
for  the  unknown  parameters.  A  confidence  interval  at  the  level  1—  a,  a  e  (0, 1),  is 
an  interval  that  covers  the  true  parameter  with  probability  1  —  or. 


P{6  e  [?/,?„])  =  1-a, 


where  6  denotes  the  (unknown)  parameter  and  #/  and  6U  are  the  lower  and  upper 
confidence  bounds,  respectively. 

Example  4.18  Consider  the  i.i.d.  random  variables  X\, . . . ,  Xn  with  ~  (p,  a2) 

£ 

and  a2  known.  Since  we  have  +Jn(x  —  pt)  ->  A^(0,  a2)  from  the  CLT,  it  follows  that 


P  (  -Ui-a/2  <  \fn 


{x  —  p) 

n -  5  ^1— a/2 

a 


1  —  of,  as  n  — >  oo 


where  u\-a/2  denotes  the  (1  —  a/ 2) -quantile  of  the  standard  normal  distribution. 
Hence  the  interval 


a 


x  — 


Ul-a/2,  X  + 


W 1— a/2 


is  an  approximate  (1  —  a) -confidence  interval  for  pi. 

But  what  can  we  do  if  we  do  not  know  the  variance  a2?  The  following  corollary 
gives  the  answer. 
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Fig.  4.4  The  CLT  for 
Bernoulli  distributed  random 
variables.  Sample  size  n  =  5 
(up)  and  n  =  35  (down)  Q 
MVAcltbern 


Asymptotic  Distribution,  n=5 


1000  Random  Samples 


Asymptotic  Distribution,  n=35 


1000  Random  Samples 


/V 

Corollary  4.1  IfY<  is  a  consistent  estimate  for  X,  then  the  CLT  still  holds,  namely 

\fn  X-1/2(x  —  /z)  — Np(f),Z)  as  n — >  oo. 
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Fig.  4.5  The  CLT  in  the  two-dimensional  case.  Sample  size  n  =  5  (left)  and  n  =  85  (right)  Q 
MVAcltbern2 


Example  4.19  Consider  the  i.i.d.  random  variables  X\ , . . . ,  Xn  with  Xi  ~  (/z,  cr2), 
and  now  with  an  unknown  variance  a2 .  From  Corollary  4. 1  using  a2  —  ^  Y^=  i  ( xi  ~ 
x)2  we  obtain 


L  N( 0, 1) 


as  n  — >  oo. 


Hence  we  can  construct  an  approximate  (1  —  a) -confidence  interval  for  pt  using  the 
variance  estimate  d2 : 


Wl-a/2>  X  + 


Ul-oi/2  • 


Note  that  by  the  CLT 


P(/i  e  C i-a)  — >  1  —  a  as  n  — >  oo. 


Remark  4.1  One  may  wonder  how  large  should  n  be  in  practice  to  provide 
reasonable  approximations.  There  is  no  definite  answer  to  this  question:  it  mainly 
depends  on  the  problem  at  hand  (the  shape  of  the  distribution  of  the  Xi  and  the 
dimension  of  Xi).  If  the  Xi  are  normally  distributed,  the  normality  of  x  is  achieved 
from  n  —  1.  In  most  situations,  however,  the  approximation  is  valid  in  one¬ 
dimensional  problems  for  n  larger  than,  say,  50. 
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Transformation  of  Statistics 

Often  in  practical  problems,  one  is  interested  in  a  function  of  parameters  for 
which  one  has  an  asymptotically  normal  statistic.  Suppose  for  instance  that  we  are 
interested  in  a  cost  function  depending  on  the  mean  \i  of  the  process:  / (pi)  — 
pA  Apt  where  A  >  0  is  given.  To  estimate  pi  we  use  the  asymptotically  normal 
statistic  x.  The  question  is:  how  does  / (x)  behave?  More  generally,  what  happens 
to  a  statistic  t  that  is  asymptotically  normal  when  we  transform  it  by  a  function 
/ (t)l  The  answer  is  given  by  the  following  theorem. 

Theorem  4.11  If  n(t  —  pi)  Np( 0,  E)  and  if  f  —  (/i, . . . ,  fq)T  :  Rp  -> 
W1  are  real  valued  functions  which  are  differentiable  at  /jL  e  R^,  then  f(t)  is 
asymptotically  normal  with  mean  f  (fi)  and  covariance  VT  EP,  i.e. 

Vn{f(t)  ~  f(^)}  —r  Nq(0,'DTZ'D)  for  n  — >  oo,  (4.56) 


where 


t=ii 


is  the  ( p  x  q)  matrix  of  all  partial  derivatives. 

Example  4.20  We  are  interested  in  seeing  how  f(fc)  =  xT  Ax  behaves  asymp¬ 
totically  with  respect  to  the  quadratic  cost  function  of  p ,  / (/x)  =  pA  A\i,  where 

A  >  0. 


vw 

dx 


=  2  Apt. 

X=jl 


By  Theorem  4.11  we  have 


An(xT Ax  —  pA Apt)  — A  N\  (0, 4pA AT^Ap) 


c 


T 


Example  4.21  Suppose 


^  1 °J5  I  ’  P  =  2- 


We  have  by  the  CLT  (Theorem  4.10)  for  n  ->  oo  that 


c 


> Jn{x  —  p)  — >  N(0,  E) 
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Suppose  that  we  would  like  to  compute  the  distribution  of 
to  Theorem  4.11  we  have  to  consider  /  =  (/i,  /2)T  with 


-1  X—  Y  According 
Xi  +  3x2  ) 


fi(xi,x2)  =  Xi -x2,  fi{x  \,X2)  =  Xi +3x2,  q  =  2. 


Given  this  / (/a)  =  ([[)  and 


V  =  ( dij ),  dij  = 


(5‘) 


x=0 


Thus 


The  covariance  is 


VT  S  V  VT  EP  VT?,V 


which  yields 


Example  4.22  Let  us  continue  the  previous  example  by  adding  one  more  compo¬ 
nent  to  the  function  /.  Since  q  —  3  >  p  —  2,  we  might  expect  a  singular  normal 
distribution.  Consider  /  =  (/i,  f2,  fo)1  with 


f\(x\,x2)  =  xj  -  x2,  f2(xi,x2)  =  Xi  +  3x2,  h  =  x\,  q  =  3. 


From  this  we  have  that 


and  thus  VT  HV 


) 


The  limit  is  in  fact  a  singular  normal  distribution! 
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'  Summary 

^  If  X\, . . . ,  Xn  are  i.i.d.  random  vectors  with  X[  ~  Np(/jL,T,),  then 

£  ~  £££ 

^  If  Xi , . . . ,  Xn  are  i.i.d.  random  vectors  with  Xj  ~  (/z,  E),  then  the 
distribution  of  ^/n(x  —  /i)  is  asymptotically  7V(0,  E)  (Central  Limit 
Theorem). 

^  If  Xi, . . . ,  are  i.i.d.  random  variables  with  Xt  ~  (/z,  cr),  then 
an  asymptotic  confidence  interval  can  be  constructed  by  the  CLT: 

X  ^  pTt  Ml-“/2- 

£ 

If  t  is  a  statistic  that  is  asymptotically  normal,  i.e.  *Jn(t  —  /z)  — > 

Np( 0,  E),  then  this  holds  also  for  a  function  f(t),  i.e.  +Jn{f  (t)  — 
f  (fi)}  is  asymptotically  normal. 


4.6  Heavy-Tailed  Distributions 

Heavy-tailed  distributions  were  first  introduced  by  the  Italian-born  Swiss  economist 
Pareto  and  extensively  studied  by  Paul  Levy.  Although  in  the  beginning  these 
distributions  were  mainly  studied  theoretically,  nowadays  they  have  found  many 
applications  in  areas  as  diverse  as  finance,  medicine,  seismology,  structural  engi¬ 
neering.  More  concretely,  they  have  been  used  to  model  returns  of  assets  in 
financial  markets,  stream  flow  in  hydrology,  precipitation  and  hurricane  damage 
in  meteorology,  earthquake  prediction  in  seismology,  pollution,  material  strength, 
teletraffic  and  many  others. 

A  distribution  is  called  heavy-tailed  if  it  has  higher  probability  density  in  its 
tail  area  compared  with  a  normal  distribution  with  same  mean  /x  and  variance  a2. 
Figure  4.6  demonstrates  the  differences  of  the  pdf  curves  of  a  standard  Gaussian 
distribution  and  a  Cauchy  distribution  with  location  parameter  fi  —  0  and  scale 
parameter  o  —  1.  The  graphic  shows  that  the  probability  density  of  the  Cauchy 
distribution  is  much  higher  than  that  of  the  Gaussian  in  the  tail  part,  while  in  the 
area  around  the  centre,  the  probability  density  of  the  Cauchy  distribution  is  much 
lower. 

In  terms  of  kurtosis,  a  heavy-tailed  distribution  has  kurtosis  greater  than  3  (see 
Chap.  4,  formula  (4.40)),  which  is  called  leptokurtic,  in  contrast  to  mesokurtic  dis¬ 
tribution  (kurtosis  =  3)  and  platykurtic  distribution  (kurtosis  <  3).  Since  univariate 
heavy-tailed  distributions  serve  as  basics  for  their  multivariate  counterparts  and  their 
density  properties  have  been  proved  useful  even  in  multivariate  cases,  we  will  start 
from  introducing  some  univariate  heavy-tailed  distributions.  Then  we  will  move  on 
to  analyse  their  multivariate  counterparts  and  their  tail  behaviour. 
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Distribution  Comparison 


X 


Fig.  4.6  Comparison  of  the  pdf  of  a  standard  Gaussian  (blue)  and  a  Cauchy  distribution  (red)  with 
location  parameter  0  and  scale  parameter  1  Q  MVAgauss cauchy 


PDF  of  GH,  HYP  and  NIG 


CDF  of  GH,  HYP  and  NIG 


X 


X 


Fig.  4.7  pdf  (left)  and  cdf  (right)  of  GH  (A  =  0.5,  black),  HYP  (red),  and  NIG  (blue)  with 
a  =  1 ,  ft  =  0 ,S  =  1 ,  /z  =  0  O  MVAghdi s 


Generalised  Hyperbolic  Distribution 

The  generalised  hyperbolic  distribution  was  introduced  by  Barndorff-Nielsen  and  at 
first  applied  to  model  grain  size  distributions  of  wind  blown  sands.  Today  one  of 
its  most  important  uses  is  in  stock  price  modelling  and  market  risk  measurement. 
The  name  of  the  distribution  is  derived  from  the  fact  that  its  log-density  forms  a 
hyperbola,  while  the  log-density  of  the  normal  distribution  is  a  parabola  (Fig.  4.7). 
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The  density  of  a  one-dimensional  generalised  hyperbolic  (GH)  distribution  for 

x  G  R  is 


fGn(x;X,a,p,8,ti) 

=  (y/a2-p/&f  Kx- 1/2  W*2  +  (x-  W}  57 

\Fht \J a1  —  P2)  ^82  +  (x  —  ii)2 / a)l/2~x 

where  is  a  modified  Bessel  function  of  the  third  kind  with  index  A 

1  r 00 

/sa(x)  =  -  yx~xe~i(y+y~']dy  (4.58) 

2  Jo 

The  domain  of  variation  of  the  parameters  is  /x  e  R  and 


5  >  0,  |/3|  <  a,  if  A  >  0 

5  >  0,  |/3|  <  a,  if  A  =  0 

5  >  0,  \P\  <  a,  if  A  <  0 


The  generalised  hyperbolic  distribution  has  the  following  mean  and  variance 


E[Y]  —  /x  + 


8/3  K,+l(8^^pi) 


Var[X]  =  S'- 


-  J2  Kx{Sja^p) 

Kx+iiS^^W)  +  Kx+2(8 

Syjoi1  —  j32Kx(8^a2  —  fi2)  a2  —  j32\_  Kx(8y/a2  - /32) 

KxiS^2^2) 


(4.59) 


2~i 


(4.60) 


Where  (i  and  8  play  important  roles  in  the  density’s  location  and  scale  respectively. 
With  specific  values  of  A,  we  obtain  different  sub-classes  of  GH  such  as  hyperbolic 
(HYP)  or  normal-inverse  Gaussian  (NIG)  distribution. 

For  A  =  1  we  obtain  the  hyperbolic  distributions  (HYP) 


fnYp(x;a,p,8,fJL)  =  - ^  Js2+u-^)2+fHx-v)} 

2 a8K{  ( S  y/a2  -  fi2) 


(4.61) 


where  x,  n  e  R,  S  >0  and  \/3\  <  a.  For  A  =  —1/2  we  obtain  the  NIG  distribution 


fmG(x;a,  P,S,  n) 


a8_  Kija^jS2  +  (x  -  /r)2)) 

71  \/S2  +  (x  -  n)2 


(4.62) 
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Student’s  t -Distribution 

The  ^-distribution  was  first  analysed  by  Gosset  (1908)  who  published  it  under 
pseudonym  “Student”  by  request  of  his  employer.  Let  I  be  a  normally  distributed 
random  variable  with  mean  /z  and  variance  cr2,  and  Y  be  the  random  variable  such 
that  Y2/(j2  has  a  chi-square  distribution  with  n  degrees  of  freedom.  Assume  that  X 
and  Y  are  independent,  then 


(4.63) 


is  distributed  as  Student’s  t  with  n  degrees  of  freedom.  The  ^-distribution  has  the 
following  density  function 


ft(x;n)  = 


(4.64) 


where  n  is  the  number  of  degrees  of  freedom,  —  oo  <  x  <  oo,  and  T  is  the  gamma 
function: 


e  Xdx. 


(4.65) 


The  mean,  variance,  skewness  and  kurtosis  of  Student’s  t -distribution  (n  >  4)  are: 


/i  —  0 

9  n 

or  —  - 

n  —  2 

Skewness  =  0 

6 

Kurtosis  =  3  4 - . 

n  —  4 

The  ^-distribution  is  symmetric  around  0,  which  is  consistent  with  the  fact  that  its 
mean  is  0  and  skewness  is  also  0  (Fig.  4.8). 

Student’s  t -distribution  approaches  the  normal  distribution  as  n  increases,  since 

1 

lim  ft(x;n)  =  _ e  2  .  (4.66) 

n^°°  V27T 

In  practice  the  ^-distribution  is  widely  used,  but  its  flexibility  of  modelling  is 
restricted  because  of  the  integer- valued  tail  index. 

In  the  tail  area  of  the  ^-distribution,  x  is  proportional  to  |x|-(/7+1).  In  Fig.  4.13 
we  compared  the  tail-behaviour  of  t  -distribution  with  different  degrees  of  freedom. 
With  higher  degree  of  freedom,  the  t -distribution  decays  faster. 
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PDF  of  t-distribution  CDF  of  t-distribution 


X  X 

Fig.  4.8  pdf  (left)  and  cdf  (right)  of  t  -distribution  with  different  degrees  of  freedom  (t3  stands  for 
t -distribution  with  degree  of  freedom  3)  Q  MVAtdis 


Laplace  Distribution 

The  univariate  Laplace  distribution  with  mean  zero  was  introduced  by  Laplace 
(1774).  The  Laplace  distribution  can  be  defined  as  the  distribution  of  differences 
between  two  independent  variates  with  identical  exponential  distributions.  There¬ 
fore  it  is  also  called  the  double  exponential  distribution  (Fig.  4.9). 

The  Laplace  distribution  with  mean  /x  and  scale  parameter  0  has  the  pdf 

./Laplace  (*!  fl,  9)  =  (4.67) 


and  the  cdf 


^Laplace  (-L  /L  $) 


1 

2 


1  +  sign(x  —  /x)(  1  —  e 


(4.68) 


where  sign  is  sign  function.  The  mean,  variance,  skewness  and  kurtosis  of  the 
Laplace  distribution  are 


li  —  li 

a2  =  29 2 
Skewness  =  0 
Kurtosis  =  6 

With  mean  0  and  0  =  1 ,  we  obtain  the  standard  Laplace  distribution 
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PDF  of  Laplace  distribution  CDF  of  Laplace  distribution 


X  X 

Fig.  4.9  pdf  (left)  and  cdf  (right)  of  Laplace  distribution  with  zero  mean  and  different  scale 
parameters  (LI  stands  for  Laplace  distribution  with  0  =  1)  Q  MVAlaplacedis 


f(x)  = 

e  11 

2 

(  ^ 

for 

x  <  0 

(4.69) 

F(x)  = 

■  - 

%  tor 

x  >  0 

(4.70) 

Cauchy  Distribution 

The  Cauchy  distribution  is  motivated  by  the  following  example. 

Example  4.23  A  gangster  has  just  robbed  a  bank.  As  he  runs  to  a  point  s  metres 
away  from  the  wall  of  the  bank,  a  policeman  reaches  the  crime  scene  behind  the 
wall  of  the  bank.  The  robber  turns  back  and  starts  to  shoot  but  he  is  such  a  poor 
shooter  that  the  angle  of  his  fire  (marked  in  Fig.  4.10  as  a)  is  uniformly  distributed. 
The  bullets  hit  the  wall  at  distance  x  (from  the  centre).  Obviously  the  distribution 
of  x,  the  random  variable  where  the  bullet  hits  the  wall,  is  of  vital  knowledge  to  the 
policeman  in  order  to  identify  the  location  of  the  gangster.  (Should  the  policeman 
calculate  the  mean  or  the  median  of  the  observed  bullet  hits  {x/}”=1  in  order  to 
identify  the  location  of  the  robber?) 

Since  a  is  uniformly  distributed: 

f(a)  —  —  I (a  e  [— tt/2,  tt/ 2]) 

71 
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Fig.  4.10  Introduction  to  Cauchy  distribution — robber  vs.  policeman 


and 


x 

tan  a  —  — 


a  —  arctan  (  — 


© 


da  —  - 


1  1 


*  i  +  (f  )2 


dx 


For  a  small  interval  da ,  the  probability  is  given  by 


f(a)da  — 


—da 

TC 

1 


STC 


i  _i_  tx_  y 

V  s  ) 


dx 


with 
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/ 


oo 


— oo  STC 


i+© 


dx=  —  j  arctan 

7T  '  V  S  /  )  -oo 


1  t7T 

n  12 

1 


(-i)i 


So  the  pdf  of  x  can  be  written  as: 


/w  = 


i 


i 


S7t  1  +  (f  )2 


The  general  formula  for  the  pdf  and  cdf  of  the  Cauchy  distribution  is 


f Cauchy  (x;m,s)  = 


1 


1 


S7T  1  +  (^)2 


1  1  (  X  —  1YI 

^Cauchy  (x;  m ,  s)  =  -  +  —  arctan  - 

2  7 r  V  5 


(4.71) 

(4.72) 


where  m  and  s  are  location  and  scale  parameter  respectively.  The  case  in  the  above 
example  where  m  —  0  and  s  =  1  is  called  the  standard  Cauchy  distribution  with 
pdf  and  cdf  as  following, 


J  Cauchy  —  /1 

7l(  1  +  Xz) 

(4.73) 

1  arctan(x) 

T^Cauchy  \X,1Tl,S)  —  _ 

2  71 

(4.74) 

The  mean,  variance,  skewness  and  kurtosis  of  Cauchy  distribution  are  all  undefined, 
since  its  moment  generating  function  diverges.  But  it  has  mode  and  median,  both 
equal  to  the  location  parameter  m  (Fig.  4. 1 1). 


Mixture  Model 

Mixture  modelling  concerns  modelling  a  statistical  distribution  by  a  mixture  (or 
weighted  sum)  of  different  distributions.  For  many  choices  of  component  density 
functions,  the  mixture  model  can  approximate  any  continuous  density  to  arbitrary 
accuracy,  provided  that  the  number  of  component  density  functions  is  sufficiently 
large  and  the  parameters  of  the  model  are  chosen  correctly.  The  pdf  of  a  mixture 
distribution  consists  of  n  distributions  and  can  be  written  as: 
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PDF  of  Cauchy  distribution  CDF  of  Cauchy  distribution 


X  X 

Fig.  4.11  pdf  (left)  and  cdf  (right)  of  Cauchy  distribution  with  m  =  0  and  different  scale 
parameters  (Cl  stands  for  Cauchy  distribution  with  s  =  1)  Q  MVAcauchy 


L 

fix)  =  '%2wiPi(x)  (4.75) 

/= l 

under  the  constraints: 

0  <  w/  <  1 
L 

J2wi  = 1 

i=i 

J  pi(x)dx  —  1 

where  pi  (x)  is  the  pdf  of  the  /’  th  component  density  and  w/  is  a  weight.  The  mean, 
variance,  skewness  and  kurtosis  of  a  mixture  are 

L 

ji  =  Y  wiin  (4.76) 

i=i 

L 

a2  =  ^  w/{of  +  (/x/  -  pi)2}  (4.77) 

i=i 

Skewness  =  ^  w,  |  SK,  +  3a'  ^  ~  ^  +  |  (4.78) 
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6  (hi  -  l-tfaf 


+ 


4  (jii  - 


+ 


/X/  /X 


G 


(4.79) 


where  /x/ ,  <7/ ,  &K/  and  K/  are  respectively  mean,  variance,  skewness  and  kurtosis  of 
/’th  distribution. 

Mixture  models  are  ubiquitous  in  virtually  every  facet  of  statistical  analysis, 
machine  learning  and  data  mining.  For  data  sets  comprising  continuous  vari¬ 
ables,  the  most  common  approach  involves  mixture  distributions  having  Gaussian 
components. 

The  pdf  for  a  Gaussian  mixture  is: 


/gmW  — 


L 


E 


wi 


2  a 


\[7jzoi 


(4.80) 


For  a  Gaussian  mixture  consisting  of  Gaussian  distributions  with  mean  0,  this  can 
be  simplified  to: 


/„„<x)=£^u<r4 


with  variance,  skewness  and  kurtosis 


a2  =  ^w/of 
/  =  1 

Skewness  =  0 


L 

Kurtosis  =  E  wi 

i= l 


(4.81) 


(4.82) 

(4.83) 

(4.84) 


Example  4.24  Consider  a  Gaussian  Mixture  which  is  80  %  N( 0,1)  and  20  %  Af(0,9). 
The  pdf  of  Af(0,l)  and  N( 0,9)  are  (Fig.  4.12): 


fN(  0,1)  W 


fm  0,9)  (*) 


1 E 

1 

- - - p  18 


so  the  pdf  of  the  Gaussian  Mixture  is 
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Pdf  of  a  Gaussian  mixture  and  Gaussian  Cdf  of  a  Gaussian  mixture  and  Gaussian 


X  X 


Fig.  4.12  pdf  (left)  and  cdf  (right)  of  a  Gaussian  mixture  (Example  4.24)  Q  MVAmixture 


Table  4.2  Basic  statistics  of 
t ,  Laplace  and  Cauchy 
distribution 


t 

Laplace 

Cauchy 

Mean 

0 

/X 

Not  defined 

Variance 

n 

n—2 

2d 2 

Not  defined 

Skewness 

0 

0 

Not  defined 

Kurtosis 

3  -f-  6 

^  ~  72  —  4 

6 

Not  defined 

/gmW  — 


5x/2 


TC 


(^4e 


X 


1 

+  r 


X 

18 


) 


Notice  that  the  Gaussian  Mixture  is  not  a  Gaussian  distribution: 


fi  —  0 

a2  =  0.8  x  1  +  0.2  x  9  =  2.6 
Skewness  =  0 

/  i  v  /  v 9  y 

Kurtosis  =  0.8  x  [  )  x  3  +  0.2  x  (  )  x  3  =  7.54 

W2 1>)  W iZ) 


The  kurtosis  of  this  Gaussian  mixture  is  higher  than  3. 


A  summary  of  the  basic  statistics  is  given  in  Tables  4.2  and  4.3. 
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Table  4.3  Basic  statistics  of  GH  distribution  and  mixture  model 


Mean 


Variance 


Mixture 

Mean 

Variance 


Skewness 


Kurtosis 


/x  + 


<5/6  Kx+i(8'\/a2-\-f32) 


a-+/6-  K\(8  */a2+P2) 


^2  Kx+i(8^/a2-\-p2)  ft2  Kx+2(8  V  a2  +  /62) 

<5A/a2+/62XA(5A/«2+i62)  a2+^2  *A(S<v/a2+|82) 


^A+i(<5a/  a2+/?2) 
ft(«Va2  +  /32) 


Ef=i  wum 

Ef=i  wXof  +  (m/  -  m)2} 


EE.w,  f  5*/  +  ^ 


3  ofilLl—ll) 


l-l  l— l-l 


EL 


L  )  f  <71  I  6(/X/  /0  °7  |  4(/Z/  /x)<T/  frpr  | 

/  =  1  W/ )  Kl  4 - ^ - 1 - ^ — SKl  +  l 


IM—Jl 


Multivariate  Generalised  Hyperbolic  Distribution 


The  multivariate  Generalised  Hyperbolic  Distribution  (GH^ )  has  the  following  pdf 


/gjdO;  X,a,/3,8,  A ,/x)  =  ad 


Kx_d.  +  (x  —  /x)TA  —  | 


,/3t(x-/x) 


of  1  +  (x  —  /x)T  A  1  (x  —  /x 


(Va2-£TA^) 

cid  =  ad(k,ot,f},8,  A)  =  - s - .  i 

(27r)2^A(^a2-^TA^ 

and  characteristic  function 


f-A 


(4.85) 


(4.86) 


0(0  = 


of2  —  /3T  A/3  \2 

a2  —  /3T  A/3  +  |fTAf  —  i/3T  At ) 

Kx  (5  yja1  -  A/ST  +  \tT  At  -  ifiT  At) 

Kx(Sy/a  2-/UA/6T) 


(4.87) 
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These  parameters  have  the  following  domain  of  variation: 


A  e  R,  P,  pt  e  Rd 
8  >  0,  a  >  PtAP 
A  eRdxd  positive  definite  matrix 


For  A  =  we  obtain  the  multivariate  hyperbolic  (HYP)  distribution;  for  A  =  —  ^ 
we  get  the  multivariate  normal  inverse  Gaussian  (NIG)  distribution. 

Blaesild  and  Jensen  (1981)  introduced  a  second  parameterisation  (£,  IT,  £),  where 


?  =  sf2-pT  AP 

(4.88) 

11  ~  P]j a2-pTAp 

(4.89) 

£  =  S2  A 

(4.90) 

The  mean  and  variance  of  X  ~  GHj 

E[X]  =n  +  8Rx(S)  nA5 

(4.91) 

Var[V]  =  a2{r%(?)A  +  SA(?)(nAi)T(IIAb} 

(4.92) 

where 

r,  /  \  Xx-\-\  (v) 

Rx(x)  =  v  ,  , 

Kx(x) 

(4.93) 

Kx+2(x)Kx(x)~  K2+](x) 

"{X  K2x{x) 

(4.94) 

Theorem  4.12  Suppose  that  X  is  a  d  -dimensional  variate  distributed  according  to 
the  generalised  hyperbolic  distribution  GHj.  Let  (X\,  Xf)  be  a  partitioning  of  X, 
let  r  and  k  denote  the  dimensions  of  X\  and  X2,  respectively,  and  let  (/3i,  ft 2)  and 
(flu  112)  he  similar  partitions  of  ft  and  pi,  let 

(  An  An\ 

V  A21  A22  ) 


A  = 


(4.95) 
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be  a  partition  of  A  such  that  An  is  a  r  x  r  matrix.  Then  one  has  the  following 

1.  The  distribution  of  X\  is  the  r -dimensional  generalised  hyperbolic  distribution, 
GHr(A*,  a*,  /3*,  5*,  /x*,  A*),  where 

A*  =  A 

01*  =  |  An  |  2r  {c^2  —  ^2(A22  —  A21  An!  Ai2)^J}2 
P*  =  Pi  +  ^2  A21  Ajj1 
=  S|An|i 
^ 

A*  =  |  A|“r  An 

2.  77z£  conditional  distribution  of  X2  given  X\  —  x\  is  the  k -dimensional 
generalised  hyperbolic  distribution  GH^(A,  a,  /3,  8,  jl,  A), where 

r 

A  =  A - 

2 

2_ 

a  =  o' | An  2/: 

p  =  & 

8  =  |An|“i{52  +  (xi  —  /xi)t}2 

A  =  M2  +  (*1  -  /iOA^1  A 12 
A  =  |Anp(A22  —  A2iA111Ai2) 

3.  Let  Y  =  AA4  -\-  B  be  a  regular  affine  transformation  of  X  and  let 
1 1^4 1 1  denote  the  absolute  value  of  the  determinant  of  A.  The  distri¬ 
bution  of  Y  is  the  d  -dimensional  generalised  hyperbolic  distribution 
GH</(A+,  a+,  +  ,  8+,  /x+,  A +), where 

A+  =  A 
af+  =  afll^ll-^ 

P+  =  /}(A-l)J 
8+  =  M\* 

H- 

A+  =  P||-^tA^ 
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Multivariate  t  -Distribution 


If  X  and  Y  are  independent  and  distributed  as  N p  ( /j  ,  E)  and  X2  respectively,  and 
X  sj n  /  Y  =  t  —  //  ,  then  the  pdf  of  t  is  given  by 


ft(t;n,  £,/x)  = 


r{(n+p)/2} 


T{n/2)nP/27iP/2\X\1/2  {l  +  -  fi)} 


(n+p)/ 2 


(4.96) 

The  distribution  of  t  is  the  noncentral  £ -distribution  with  n  degrees  of  freedom  and 
the  noncentrality  parameter  /x,  Giri  (1996). 


Multivariate  Laplace  Distribution 


Let  g  and  G  be  the  pdf  and  cdf  of  a  d -dimensional  Gaussian  distribution  Np  (0,  £), 
the  pdf  and  cdf  of  a  multivariate  Laplace  distribution  can  be  written  as 

poo  1  1 

/MLapiace^C-^;"!,  E)  =  /  g{z~^ x  -  z^m)z~1  e~zdz  (4.97) 

Jo 


pOO  .  ^ 

^ MLapiacc(i  W ,  E)  =  /  G(z~^ x  -  z1  m)e~zdz 

Jo 


the  pdf  can  also  be  described  as 


(4.98) 


yMLaplace^  (-V,  /X2,  X)  — 


2e 


’m 


A 

2 


2  +  mT£-1 


(2tt)  2  |E 
xKxLfd  +  mT  E_1nt)(xT  E"1*; 


(4.99) 


where  A  =  and  Kx(x)  is  the  modified  Bessel  function  of  the  third  kind 


'W*>=2  2 


i/x'x 


)X 


t  x  le  1  r/r, 


x  >  0 


(4.100) 


Multivariate  Laplace  distribution  has  mean  and  variance 

E[X]  —  m 

Cov[X]  =  E  +  mmT 


(4.101) 

(4.102) 
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Tail  comparison  of  t-distribution 


Fig.  4.13  Tail  comparison  of  t -distribution  Q  MVAtdistail 


Multivariate  Mixture  Model 


A  multivariate  mixture  model  comprises  multivariate  distributions,  e.g.  the  pdf  of  a 
multivariate  Gaussian  distribution  can  be  written  as 


L 


fix)  =  y] 


Wl 


1  =  1 


,-^(x-/x/)tE  1  (*-/*/) 


(4.103) 


Generalised  Hyperbolic  Distribution 

The  GH  distribution  has  an  exponential  decaying  speed 

fcn(x;  \,a,  8, /i  =  0)  ~  xx~le~^a~^x  as  x  ->  oo,  (4.104) 

As  a  comparison  to  tail  behaviour  of  f-distribution  depicted  in  Fig.  4. 13,  the 
Fig.  4.14  illustrates  the  tail  behaviour  of  GH  distributions  with  different  value  of 
A  with  a  =  1,  P  =  0 ,8  =  1,  /x  =  0.  It  is  clear  that  among  the  four  distributions, 
GH  with  A  =  1.5  has  the  lowest  decaying  speed,  while  NIG  decays  faster. 

In  Fig.  4.15,  Chen,  Hardle,  and  Jeong  (2008),  four  distributions  and  especially 
their  tail-behaviour  are  compared.  In  order  to  keep  the  comparability  of  these 
distributions,  we  specified  the  means  to  0  and  standardised  the  variances  to  1. 
Furthermore  we  used  one  important  subclass  of  the  GH  distribution:  the  NIG 
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Tail  comparison  -  GH 

o 


Fig.  4.14  Tail  comparison  of  GH  distribution  (pdf)  Q  MVAghdistail 


Distribution  comparison 


Tail  comparison 


o 


Fig.  4.15  Graphical  comparison  of  the  NIG  distribution  (line),  standard  normal  distribution  Q 
MVAghadatai 1 


distribution  with  A  =  —  |  introduced  above.  On  the  left  panel,  the  complete  forms 
of  these  distributions  are  revealed.  The  Cauchy  (dots)  distribution  has  the  lowest 
peak  and  the  fattest  tails.  In  other  words,  it  has  the  flattest  distribution.  The  NIG 
distribution  decays  second  fast  in  the  tails  although  it  has  the  highest  peak,  which  is 
more  clearly  displayed  on  the  right  panel. 
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4.7  Copulae 

The  cumulative  distribution  function  (cdf)  of  a  two-dimensional  vector  (X\,  X2)  is 
given  by 


F  (xi,x2)  =  P(Zi  <  x\,  X2  <  x2) .  (4.105) 

For  the  case  that  X\  and  X2  are  independent,  their  joint  cumulative  distribution 
function  F(x i,x2)  can  be  written  as  a  product  of  their  one-dimensional  marginals: 

F(x i,x2)  =  FXl  (xi)  Fx 2  (x2)  =  P(Xi<  xi)  P  (X2  <  x2) .  (4.106) 

But  how  can  we  model  dependence  of  X\  and  X21  Most  people  would  suggest 
linear  correlation.  Correlation  is  though  an  appropriate  measure  of  dependence  only 
when  the  random  variables  have  an  elliptical  or  spherical  distribution,  which  include 
the  normal  multivariate  distribution.  Although  the  terms  “correlation”  and  “depen¬ 
dency”  are  often  used  interchangeably,  correlation  is  actually  a  rather  imperfect 
measure  of  dependency,  and  there  are  many  circumstances  where  correlation  should 
not  be  used. 

Copulae  represent  an  elegant  concept  of  connecting  marginals  with  joint  cumula¬ 
tive  distribution  functions.  Copulae  are  functions  that  join  or  “couple”  multivariate 
distribution  functions  to  their  one-dimensional  marginal  distribution  functions.  Let 
us  consider  a  d -dimensional  vector  X  —  (X\, . . . ,  Xj)T.  Using  copulae,  the 
marginal  distribution  functions  FXi  (i  —  1  can  be  separately  modelled 

from  their  dependence  structure  and  then  coupled  together  to  form  the  multivariate 
distribution  Fx.  Copula  functions  have  a  long  history  in  probability  theory  and 
statistics.  Their  application  in  finance  is  very  recent.  Copulae  are  important  in  Value- 
at-Risk  calculations  and  constitute  an  essential  tool  in  quantitative  finance  (Hardle 
et  al.,  2009). 

First  let  us  concentrate  on  the  two-dimensional  case,  then  we  will  extend  this 
concept  to  the  d  -dimensional  case,  for  a  random  variable  in  Wl  with  d  >  1 .  To  be 
able  to  define  a  copula  function,  first  we  need  to  represent  a  concept  of  the  volume 
of  a  rectangle,  a  2 -increading  function  and  a  grounded  function. 

Let  U\  and  U2  be  two  sets  in  R  =  MU{+oo}U{— 00}  and  consider  the  function 
F  :  U\  x  U2  — >  R. 


Definition  4.2  The  F -volume  of  a  rectangle  B  —  [x\ ,  x2\  x  [y\ ,  y2\  C  U\  x  U2  is 
defined  as: 


Vf(B )  =  F(x2,  y2)  -  F(x\,y2)  -  F(x2,y  1)  +  F(xi,yi)  (4.107) 
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Definition  4.3  F  is  said  to  be  a  2-increasing  function  if  for  every  B  —  [x\ ,  X2]  x 
[yi,  yi\  CU]X  U2, 


Vf(B)  >  0  (4.108) 

Remark  4.2  Note  that  “to  be  2-increasing  function”  neither  implies  nor  is  implied 
by  “to  be  increasing  in  each  argument”. 

The  following  lemmas  (Nelsen,  1999)  will  be  very  useful  later  for  establishing 
the  continuity  of  copulae. 

Lemma  4.1  Let  U\  and  U2  be  non-empty  sets  in  R  and  let  F  :  U\  x  U2  — >  R  be  a 

two-increasing  function.  Let  X\,  X2  be  in  U\  with  x\  <  X2,  and  y  \,  y2  be  in  U2  with 

y  1  <  y 2.  Tton  the  function  t  i->  F(t,y2)  —  F(t,y  1)  A  non-decreasing  on  U\  and 
the  function  t  i->  F(x2,  t )  —  F(xi,  t)  A  non- decreasing  on  U2. 

Definition  4.4  If  Lfi  and  U2  have  a  smallest  element  minf/i  and  minf/2  respec¬ 
tively,  then  we  say  that  a  function  F  :  U\  x  U2  — >  R  is  grounded  if : 

for  all  x  G  U\  :  F(x, minty)  =  0  and  (4.109) 

for  all  y  g  U2  :  F(min  C/i,  y)  =  0  (4.110) 

In  the  following,  we  will  refer  to  this  definition  of  a  cdf. 

— 2 

Definition  4.5  A  cdf  is  a  function  from  R  1 — >  [0, 1]  which 

(i)  is  grounded 

(ii)  is  2-increasing 

(iii)  satisfies  F  (00,  00)  =  1 

Lemma  4.2  Let  U\  and  U2  be  non-empty  sets  in  R  and  let  F  :  U\  x  U2  — >  R  to 
<2  grounded  two-increasing  function.  Then  F  is  non-decreasing  in  each  argument. 

Definition  4.6  If  U\  and  U2  have  a  greatest  element  maxf/i  and  max  6/2  respec¬ 
tively,  then  we  say  that  a  function  F  :  U\  x  U2  — >  R  has  margins  and  that  the 
margins  of  i7  are  given  by: 

F(x)  —  F(x,  max  U2)  for  all  x  G  Lfi  (4.111) 

F(y)  =  F(max  C/i,  y)  for  all  y  G  U2  (4.112) 

Lemma  4.3  Tto  C/i  ato  U2  be  non-empty  sets  in  R  am/  to  i7  :  U\  x  U2  — >  R 
to  a  grounded  two -increasing  function  which  has  margins.  Let  (xi,  y\),  (X2,  J2)  £ 
to  x  to-  Tto/i 


I F(x2,y2)  ~  F(x i,yi)|  <  |F(x2)  -  i7(x1)|  +  IT7^)  -  (4.113) 
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Definition  4.7  A  two-dimensional  copula  is  a  function  C  defined  on  the  unit  square 
I2  —  J  x  I  with  I  =  [0,  1]  such  that 

(i)  for  every  u  e  I  holds:  C(u,  0)  =  C(0,  v)  =  0,  i.e.  C  is  grounded. 

(ii)  for  every  u\,  u2,  v\,  V2  G  /  with  u\  <  w2  and  tq  <  V2  holds: 


C(u2,  v2)  -  C(u2,  v\)  -  C(u i,  i>2)  +  C(mi,  ui)  >  0,  (4.114) 


i.e.  C  is  2-increasing. 

(iii)  for  every  u  e  /  holds  C(t/,  1)  =  and  C(l,  u)  =  v. 

2 

Informally,  a  copula  is  a  joint  distribution  function  defined  on  the  unit  square  [0, 1] 
which  has  uniform  marginals.  That  means  that  if  Fxx  Cm)  and  Fx2  (x2)  are  univariate 
distribution  functions,  then  C  {FXl  Cm),  Fx2 (m)}  is  a  two-dimensional  distribution 
function  with  marginals  FXl  (m)  and  Fx2{x2). 

Example  4.25  The  functions  ma x(u  +  v  —  1 , 0),  uv,  min (u,  v)  can  be  easily  checked 
to  be  copula  functions.  They  are  called  respectively  the  minimum,  product  and 
maximum  copula. 

Example  4.26  Consider  the  function 


CpGauss(M,  v)  =  <Pp 


fp(xi,x2)dx2dx  i 


(4.115) 


where  is  the  joint  two-dimensional  standard  normal  distribution  function  with 
correlation  coefficient  p,  while  d>i  and  d>2  refer  to  standard  normal  cdfs  and 


fp(xux2) 


1 

2;r  yj\  —  p2 


x\  —  2pX\X2  +  x\  | 
2(1 -P2)  ) 


(4.116) 


denotes  the  bivariate  normal  pdf. 

It  is  easy  to  see  that  CGauss  is  a  copula,  the  so-called  Gaussian  or  normal  copula, 
since  it  is  2-increasing  and 


$P<D-1(m),<D-1( 0)}  =  {<*>-‘(0),  d>-‘(v)}  =  0  (4.117) 


=  «and  $p  {$~l  X),  (v)}  =  v 


(4.118) 
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Fig.  4.16  Surface  plot  of  the  Gumbel-Hougaard  copula,  6  =  3  Q  MVAghsurf  ace 


A  simple  and  useful  way  to  represent  the  graph  of  a  copula  is  the  contour  diagram 
that  is,  graphs  of  its  level  sets — the  sets  in  1 2  given  by  C(u,v )  =  a  constant. 
In  Figs.  4.16  and  4.17  we  present  the  countour  diagrams  of  the  Gumbel-Hougard 
copula  (Example  4.4)  for  different  values  of  the  copula  parameter  0. 

For  0  —  1  the  Gumbel-Hougaard  copula  reduces  to  the  product  copula,  i.e. 

C\(u,v)  —  II  (u,v)  —  uv  (4.119) 

For  O^o o,  one  finds  for  the  Gumbel-Hougaard  copula: 

Cq(u,v) — >  min (u,v)  —  M(u,v)  (4.120) 

where  M  is  also  a  copula  such  that  C(u,  v )  <  M(u ,  v)  for  an  arbitrary  copula  C. 
The  copula  M  is  called  the  Frechet-Hoeffding  upper  bound. 

The  two-dimensional  function  W(u ,  v )  =  max(t/  +  v  —  1,0)  defines  a  copula 
with  W(u,  v )  <  C(u,  v )  for  any  other  copula  C.  W  is  called  the  Frechet-Hoeffding 
lower  bound. 

In  Fig.  4. 1 8  we  show  an  example  of  Gumbel-Hougaard  copula  sampling  for  fixed 
parameters  o\  —  1,  02  =  1  and  0  —  3. 

One  can  demonstrate  the  so-called  Frechet-Hoeffding  inequality,  which  we  have 
already  used  in  Example  1.3,  and  which  states  that  each  copula  function  is  bounded 
by  the  minimum  and  maximum  one: 

W(u,  v)  =  ma x(u  +  v  —  1, 0)  <  C(u,  v)  <  min (u,  v)  =  M(u ,  v)  (4.121) 
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Fig.  4.17  Contour  plots  of  the  Gumbel-Hougard  copula  Q  MVAghcontour 


The  full  relationship  between  copula  and  joint  cdf  depends  on  Sklar  theorem. 

Example  4.27  Let  us  verify  that  the  Gaussian  copula  satisfies  Sklar’s  theorem  in 
both  directions.  On  the  one  side,  let 


F(x  i,x2) 


1 

2n  yj\  —  p2 


u\  —  2pU\U2  +  M 2  ) 

2(1  -  P2)  ) 


du2du\ . 


(4.122) 


be  a  two-dimensional  normal  distribution  function  with  standard  normal  cdf’s 
F*i(jci)  and  FXl(x- 2).  Since  (vi)  and  FXl{x 2)  are  continuous,  a  unique  copula 
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Fig.  4.18  10,000-sample 
OUtpUt  for  O' i  =  1,  (72  =  1, 

6  =  3  Q  MVAsamplelOOO 


C  exists  such  that  for  all  Xi,  X2  E  M  a  two-dimensional  distribution  function  can 
be  written  as  a  copula  in  FXl  (xi)  and  i7z2fe)- 


F  (xux2)  =  C  {$*,  (xi) ,  <&x2  fe)}  (4.123) 

The  Gaussian  copula  satisfies  the  above  equality,  therefore  it  is  the  unique  copula 
mentioned  in  Sklar’s  theorem.  This  proves  that  the  Gaussian  copula,  together  with 
Gaussian  marginals,  gives  the  two-dimensional  normal  distribution. 

Conversely,  if  C  is  a  copula  and  FXx  and  FXl  are  standard  normal  distribution 
functions,  then 


C  {FXl(xi),Fx2(x2)}  = 


r</>i 

J—oo 


0J”1  {Fx\  (■*!)}  r4>j'{Fx2(X2)} 


r<p2 

J—oo 


1 


x  exp 


x\  —  2pX\X2  +  x\ 

2(1  -  P2) 


2;r  yj  1  —  p: 


dx2dx\  (4.124) 


is  evidently  a  joint  (two-dimensional)  distribution  function.  Its  margins  are 


C{FXl{xx),FXl  (+oo)}  =  %[<S>-l{FXl(Xl)},+oo]  =  FXl(X])  (4.125) 
C  {FXl(+oo),  FXl(x 2)}  =  <t>p  [+00,  <I>  1  {FXl (X2)}]  =  FX2(x2)  (4.126) 


The  following  proposition  shows  one  attractive  feature  of  the  copula  represen¬ 
tation  of  dependence,  i.e.  that  the  dependence  structure  described  by  a  copula 
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is  invariant  under  increasing  and  continuous  transformations  of  the  marginal 
distributions. 

Theorem  4.13  If  (X i ,  X2)  have  copula  C  and  set  g\,  g2  two  continuously  increas¬ 
ing  functions,  then  {gi  (X{)  ,g2  (X2)}  have  the  copula  C,  too. 

Example  4.28  Independence  implies  that  the  product  of  the  cdf’s  FXl  and  Fx 2 
equals  the  joint  distribution  function  F ,  i.e.: 


F(x  \,x2)  =  FXl(xi)FX2(x2).  (4.127) 


Thus,  we  obtain  the  independence  or  product  copula  C  —  Tl(u,v)  —  uv. 

While  it  is  easily  understood  how  a  product  copula  describes  an  independence 
relationship,  the  converse  is  also  true.  Namely,  the  joint  distribution  function  of  two 
independent  random  variables  can  be  interpreted  as  a  product  copula.  This  concept 
is  formalised  in  the  following  theorem: 

Theorem  4.14  Fet  X\  and  X2  be  random  variables  with  continuous  distribution 
functions  FX]  and  FXl  and  the  joint  distribution  function  F .  Then  X\  and  X2  are 
independent  if  and  only  if  CXxXl  —  n. 

Example  4.29  Let  us  consider  the  Gaussian  copula  for  the  case  p  —  0,  i.e.  vanishing 
correlation.  In  this  case  the  Gaussian  copula  becomes 

/of1^)  /‘Oy1(V) 

(p(x\)dx\  /  cp(x2)dx2 
-OO  7—00 

=  uv  (4.128) 

=  n  (u,  v). 

The  following  theorem,  which  follows  directly  from  Lemma  4.3,  establishes  the 
continuity  of  copulae  . 

Theorem  4.15  Fet  C  be  a  copula.  Then  for  any  u\,  v\,  u2,  v2  G  I  holds 


\C(u2,  v2)  -C(u\,v\)\  <  \u2 


U 1 


+  \v2  -  Vi 


(4.129) 


From  (4.129)  it  follows  that  every  copula  C  is  uniformly  continuous  on  its 
domain. 

A  further  important  property  of  copulae  concerns  the  partial  derivatives  of  a 
copula  with  respect  to  its  variables: 

Theorem  4.16  Let  C(u,  v)  be  a  copula.  For  any  u  G  I ,  the  partial  derivative  dC^,v) 
exists  for  almost  all  u  G  I .  For  such  u  and  v  one  has: 


dC(u,v) 


G  I 


dv 


(4.130) 
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The  analogous  statement  is  true  for  the  partial  derivative 


dC(u,v ) 
du 


dC(u,  v) 
du 


G  I 


(4.131) 


Moreover,  the  functions 


u  i->  Cv(u )  dC(u ,  v)/dv  and 


v  i->  Cu(v )  =  8C(u,v)/du 

are  defined  and  non-increasing  almost  everywhere  on  I . 

Until  now,  we  have  considered  copulae  only  in  a  two-dimensional  setting.  Let 
us  now  extend  this  concept  to  the  d  -dimensional  case,  for  a  random  variable  in  Rd 
with  d  >  1 . 

Let  U\,  U2, . . . ,  Ud  be  non-empty  sets  in  R  and  consider  the  function  F  :  U\  x 
U2  x  •  •  •  x  Ud  — >  R.  For  a  —  (a\ ,  <22, . . . ,  ad)  and  b  —  (b\ ,  £2, . . . ,  with  a  <  b 
(i.e.  ^  <  bk  for  all  k)  let  B  =  [<2, 6]  =  [<21 ,  Zq]  x  [<22,  bf\  x  •  •  •  x  [<2„,  Z>„]  be  the 
(i-box  with  vertices  c  =  (ci ,  C2, . . . ,  cfi).  It  is  obvious  that  each  Ck  is  either  equal  to 
ak  or  to  bk. 

Definition  4.8  The  F-\o lume  of  a  <i-box  B  —  [a,b]  —  [a\,  b\ ]  x  [a2,  ^2]  x  •  •  •  x 
[ad ,  bd]  C  U\  x  U2  x  •  •  •  x  Ud  is  defined  as  follows: 

d 

Vf(B)  =  J2  sign (ct)F(cfc)  (4.132) 

/c=  1 

where  sign(c^)  =  1,  if  —  ak  for  even  k  and  sign(c^)  =  —  1,  if  Ck  —  ak  for  odd  k. 

Example  4.30  For  the  case  d  —  3,  the  F -\o lume  of  a  3-box  B  —  [a,b]  =  [x\,  X2]  x 
bi ,  L2]  x  [zi ,  zi]  is  defined  as: 

Ff(^)  =  F(x2,y2,Z2)  -  F(x2,y2,zi)  -  F(x2,yi,Z2)  ~  F(xi,y2,Z2) 
+F(x2,yi9zi)  +  F(xi,y2,zi)  +  F(xi,yi,Z2)  ~  F(xi,yi,zi) 

Definition  4.9  F  is  said  to  be  a  d  -increasing  function  if  for  all  d  -boxes  B  with 
vertices  in  U\  x  U2  x  •  •  •  x  Ud  holds: 


Vf(B)  >  0 


(4.133) 
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Definition  4.10  If  U\,  U2, . . . ,  Ud  have  a  smallest  element  min  U 1,  min  C/2, . . 
min  Ud  respectively,  then  we  say  that  a  function  i7  :  U\  x  U2  x  •  •  •  x  Ud  — >  R 
is  grounded  if : 

F(jc)  =  0  for  all  x  e  Ux  x  U2  x  •  •  •  x  Ud  (4.134) 

such  that  jcfc  =  min  C4  for  at  least  one  k. 

The  lemmas,  which  we  presented  for  the  two-dimensional  case,  have  analogous 
multivariate  versions,  see  Nelsen  (1999). 

Definition  4.11  A  d -dimensional  copula  (or  d -copula)  is  a  function  C  defined  on 
the  unit  d- cube  Id  =  I  x  I  x  •  •  •  x  I  such  that 

(i)  for  every  u  e  Id  holds:  C(u )  =  0,  if  at  least  one  coordinate  of  u  is  equal  to  0; 
i.e.  C  is  grounded. 

(ii)  for  every  a,  b  e  Id  with  a  <  b  holds: 

Vc([a,b])  >0;  (4.135) 


i.e.  C  is  2-increasing. 

(iii)  for  every  u  e  Id  holds:  C(u )  =  Uk,  if  all  coordinates  of  u  are  1  except  Uk. 

Analogously  to  the  two-dimensional  setting,  let  us  state  the  Sklar’s  theorem  for 
the  d  -dimensional  case. 

Theorem  4.17  (Sklar’s  Theorem  in  d  -Dimensional  Case)  Let  F  be  a  d- 

dimensional  distribution  function  with  marginal  distribution  functions 

Fx  1 ,  Fx  2 ,  •  •  • ,  Fxd .  Then  a  d -copula  C  exists  such  that  for  all  X\ , . . . ,  xd  €  R  : 

F  (x\,X2, ,  xd)  =  C  {FXl  (xi) ,  FXl  (x2)  ,...,FXd  (xd)}  (4.136) 

Moreover,  if  Fxx ,  Fx2 ,  •  •  • ,  Fxd  are  continuous  then  C  is  unique.  Otherwise  C  is 
uniquely  determined  on  the  Cartesian  product  Im(Fxx)  x  Im(Fx2)  x*  •  •  x  Im(Fxd). 

Conversely,  if  C  is  a  copula  and  Fx{ ,  Fx2 , . . . ,  Fxd  are  distribution  functions 
then  F  defined  by  (4.136)  is  a  d  -dimensional  distribution  function  with  marginals 
T X\ ,  T x2 ,  •  •  • ,  T x  d  • 

In  order  to  illustrate  the  <i-copulae  we  present  the  following  examples: 

Example  4.31  Let  O  denote  the  univariate  standard  normal  distribution  function 
and  the  d  -dimensional  standard  normal  distribution  function  with  correlation 
matrix  E.  Then  the  function 


CpGauss(M,  E)  -  {O-'Cmi),  . . . ,  <$>~l(ud)} 


/<t>  1  l(ud )  r4>2  1 

-00  1—00 


02  (M0 


fz(xu  ... ,  xn)dx  1 . . .  dxd  (4.137) 
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is  the  d  -dimensional  Gaussian  or  normal  copula  with  correlation  matrix  E.  The 
function 


j p  Ol )  •  •  •  j  ^Cd  ) 


x  exp 


v/det(E) 


(4.138) 


is  a  copula  density  function.  The  copula  dependence  parameter  a  is  the  collection  of 
all  unknown  correlation  coefficients  in  E.  If  a  ^  0,  then  the  corresponding  normal 
copula  allows  to  generate  joint  symmetric  dependence.  However,  it  is  not  possible 
to  model  a  tail  dependence,  i.e.  joint  extreme  events  have  a  zero  probability. 

Example  4.32  Let  us  consider  the  following  function 


QGH  (wi, . . . ,  Ud)  =  exp 


i /o' 


-  ]  E  L  los  “/ ) 


0 


j= 1 


(4.139) 


One  recognise  this  function  is  as  the  d -dimensional  Gumbel-Hougaard  copula 
function.  Unlike  the  Gaussian  copula,  the  copula  (4.139)  can  generate  an  upper  tail 
dependence. 

Example  4.33  As  in  the  two-dimensional  setting,  let  us  consider  the  d -dimensional 
Gumbel-Hougaard  copula  for  the  case  0  —  1 .  In  this  case  the  Gumbel-Hougaard 
copula  reduces  to  the  d -dimensional  product  copula,  i.e. 

d 

Ci(«i,.  ..,Ud)=  Wuj  =  n  d(u)  (4.140) 

7=1 

The  extension  of  the  two-dimensional  copula  M,  which  one  gets  from  the  d- 
dimensional  Gumbel-Hougaard  copula  for  0  ->  oo  is  denoted  Md(u ): 

Cq(u\,  . . .  Ud)  — >  min(u\ , . . . ,  Ud)  =  Md  (u)  (4.141) 

The  d  -dimensional  function 

Wd  ( u )  —  max(wi  T  U2  T  *  *  *  T  Ud  —  d  T  1 , 0)  (4. 142) 

defines  a  copula  with  W(u)  <  C(u)  for  any  other  d -dimensional  copula  function 
C(u).  Wd  ( u )  is  the  Frechet-Hoeffding  lower  bound  in  the  d  -dimensional  case. 
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The  functions  Md  and  are  d-copulae  for  all  d  >  2,  whereas  the  function  Wd 
fails  to  be  a  d- copula  for  any  d  >  2  (Nelsen,  1999).  However,  the  d -dimensional 
version  of  the  Frechet-Hoeffding  inequality  can  be  written  as  follows: 

Wd(u] )  <  C(u)  <  Md(u)  (4.143) 

As  we  have  already  mentioned,  copula  functions  have  been  widely  applied  in 
empirical  finance. 


ui«j  . 


'  -n-  Summary 

^  The  cumulative  distribution  function  (cdf)  is  defined  as  F(x)  = 
P(A  <  x). 

^  If  a  probability  density  function  (pdf)  /  exists  then  F(x)  = 
/- oo  f(u)du. 

The  pdf  integrates  to  one,  i.e.  f_OQ  f  (x)dx  —  1 . 


4.8  Bootstrap 

Recall  that  we  need  large  sample  sizes  in  order  to  sufficiently  approximate  the 
critical  values  computable  by  the  CLT.  Here  large  means  n  >  50  for  one¬ 
dimensional  data.  How  can  we  construct  confidence  intervals  in  the  case  of  smaller 
sample  sizes?  One  way  is  to  use  a  method  called  the  Bootstrap.  The  Bootstrap 
algorithm  uses  the  data  twice: 

1 .  estimate  the  parameter  of  interest, 

2.  simulate  from  an  estimated  distribution  to  approximate  the  asymptotic  distribu¬ 
tion  of  the  statistics  of  interest. 

In  detail,  bootstrap  works  as  follows.  Consider  the  observations  x\, ...  ,xn  of  the 
sample  X\, . . . ,  Xn  and  estimate  the  empirical  distribution  function  (EDF)  Fn.  In 
the  case  of  one-dimensional  data 


7(v  <  x). 


(4.144) 


This  is  a  step  function  which  is  constant  between  neighbouring  data  points. 
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EDF  and  CDF,  n=100 


Fig.  4.19  The  standard  normal  cdf  ( thick  line )  and  the  empirical  distribution  function  ( thin  line ) 
for  «  =  100  Q  MVAedf  normal 


EDF  and  CDF,  n=1000 


X 


Fig.  4.20  The  standard  normal  cdf  ( thick  line )  and  the  empirical  distribution  function  ( thin  line ) 
for  ft  =  1,000  G  MVAedfnormal 


Example  4.34  Suppose  that  we  have  n  —  100  standard  normal  7V(0, 1)  data  points 

/3C 

-oc  and  is  shown  in  Fig.  4. 19  as 

the  thin,  solid  line.  The  EDF  is  displayed  as  a  thick  step  function  line.  Figure  4.20 
shows  the  same  setup  for  n  =  1,000  observations. 
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Now  draw  with  replacement  a  new  sample  from  this  empirical  distribution.  That 
is  we  sample  with  replacement  n *  observations  X* , . . . ,  X**  from  the  original 
sample.  This  is  called  a  Bootstrap  sample.  Usually  one  takes  n*  —  n. 

Since  we  sample  with  replacement,  a  single  observation  from  the  original  sample 
may  appear  several  times  in  the  Bootstrap  sample.  For  instance,  if  the  original 
sample  consists  of  the  three  observations  x\ ,  X2,  X3,  then  a  Bootstrap  sample  might 
look  like  X*  =  X3,  X2*  =  X2,  X*  —  X3.  Computationally,  we  find  the  Bootstrap 
sample  by  using  a  uniform  random  number  generator  to  draw  from  the  indices 
1,2 , ,n  of  the  original  samples. 

The  Bootstrap  observations  are  drawn  randomly  from  the  empirical  distribution, 
i.e.  the  probability  for  each  original  observation  to  be  selected  into  the  Bootstrap 
sample  is  1  /  n  for  each  draw.  It  is  easy  to  compute  that 

1  n 

Ef„(V*)  =  -  22  xi  =  X. 

i  =  1 

This  is  the  expected  value  given  that  the  cdf  is  the  original  mean  of  the  sample 
x\ , . . . ,  xn.  The  same  holds  for  the  variance,  i.e. 

VfJX*)  =  a2, 

where  a2  =  n~l  J2(xi  ~  x)2>  The  cdf  of  the  bootstrap  observations  is  defined  as  in 
(4.144).  Figure  4.21  shows  the  cdf  of  the  n  —  100  original  observations  as  a  solid 
line  and  two  bootstrap  cdf ’s  as  thin  lines. 

The  CLT  holds  for  the  bootstrap  sample.  Analogously  to  Corollary  4.1  we  have 
the  following  corollary. 


0. 

S  0. 

CO 

Sf 

O  0, 

0. 

Fig.  4.21  The  cdf  Fn  ( thick 
line )  and  two  bootstrap  cdf’s 
F*  ( thin  lines)  Q 
MVAedfbootstrap 


EDF  and  2  bootstrap  EDF's,  n=100 


X 


4.9  Exercises 


179 


Corollary  4.2  If  X*, . . . ,  X*  is  a  bootstrap  sample  from  X\, . . . ,  Xn,  then  the 
distribution  of 


x*  —  x 
a* 


also  becomes  N(0,  1)  asymptotically,  where  x* 

-x*)\ 


n  1  Y^i= i  (a*)2  = 


How  do  we  find  a  confidence  interval  for  p  using  the  Bootstrap  method?  Recall 
that  the  quantile  u\-a/2  might  be  bad  for  small  sample  sizes  because  the  true 

distribution  of  y/n  might  be  far  away  from  the  limit  distribution  N(0,  1).  The 

Bootstrap  idea  enables  us  to  “simulate”  this  distribution  by  computing  y/n  ^ xX~x  ^ 

for  many  Bootstrap  samples  .  In  this  way  we  can  estimate  an  empirical  (1  —  ot/2)- 
quantile  u*_a ,v  The  bootstrap  improved  confidence  interval  is  then 


C 


* 

1— a 


/v 


/v 


By  Corollary  4.2  we  have 

P(p  e  C*_a)  — >  1  —  a  as  n  — ►  oo, 
but  with  an  improved  speed  of  convergence,  see  Hall  (1992). 


Summary 

/  A 

(-^  For  small  sample  sizes  the  bootstrap  improves  the  precision  of  the 
confidence  interval. 

^  The  bootstrap  distribution  C{yfn(x*  —  x)/d*}  converges  to  the 
same  asymptotic  limit  as  the  distribution  C  {^/n(x*  —  x)/d}. 

4.9  Exercises 

Exercise  4.1  Assume  that  the  random  vector  Y  has  the  following  normal  distribu¬ 
tion:  Y  ~  Np(f),T).  Transform  it  according  to  (4.49)  to  create  X  ~  N(fi,  S)  with 

mean  pi  —  (3,  2)T  and  S  =  (^_j  5  XIow  would  you  implement  the  resulting 

formula  on  a  computer? 


180 


4  Multivariate  Distributions 


Exercise  4.2  Prove  Theorem  4. 7  using  Theorem  4.5. 

Exercise  4.3  Suppose  that  X  has  mean  zero  and  covariance  X  =  o  ?  ^)  •  Let  Y  — 

X]  +  X2.  Write  Y  as  a  linear  transformation,  i.e.find  the  transformation  matrix  A. 
Then  compute  Var(Y)  via  (4.26).  Can  you  obtain  the  result  in  another  fashion? 

Exercise  4.4  Calculate  the  mean  and  the  variance  of  the  estimate  j}  in  (3.50). 

Exercise  4.5  Compute  the  conditional  moments  E(X2  \xf)  and  E(X\  \x2 )for  the  pdf 
of  Example  4.5. 

Exercise  4.6  Prove  the  relation  (4.28). 

Exercise  4.7  Prove  the  relation  (4.29). 

Hint:  Note  that  Var(E(X2\Xl))  =  E(E(X2|Vi)  E(V2T |V,))  -  E(X2)  E(Xj))  and 
that 

E(VariX2\Xi))  =  E[E(X2Xj \Xi)  -  E(X2\Xl)  E(Xj \Xi)]. 

Exercise  4.8  Compute  (4.46)  for  the  pdf  of  Example  4.5. 

Exercise  4.9 


Show  that  fY(y)  — 


bi  -  3^2 

0  <  y\  <  2,  \y2\  <  1  - 

1  -yi 

0 

otherwise 

is  a  pdf. 


Exercise  4.10  Compute  (4.46)  for  a  two-dimensional  standard  normal  distribution. 
Show  that  the  transformed  random  variables  Y\  and  Y2  are  independent.  Give  a 
geometrical  interpretation  of  this  result  based  on  iso-distance  curves. 

Exercise  4.11  Consider  the  Cauchy  distribution  which  has  no  moment,  so  that  the 
CLT  cannot  be  applied.  Simulate  the  distribution  ofx  (for  different  n  ’s).  What  can 
you  expect  for  n  —>  00  ? 

Hint:  The  Cauchy  distribution  can  be  simulated  by  the  quotient  of  two  indepen¬ 
dent  standard  normally  distributed  random  variables. 

Exercise  4.12  A  European  car  company  has  tested  a  new  model  and  reports  the 
consumption  of  petrol  (X\)  and  oil  (X2).  The  expected  consumption  of  petrol  is  81 
per  100km  (\i\)  and  the  expected  consumption  of  oil  is  11  per  10,000km  (pi 2). 
The  measured  consumption  of  petrol  is  8.11  per  100km  (x\)  and  the  measured 
consumption  of  oil  is  1.1 1  per  10,000  km  (X2).  The  asymptotic  distribution  of 

>{©-©}»«(o,  (-»„»)). 

For  the  American  market  the  basic  measuring  units  are  miles  (1  mile  &  1.6  km ) 
and  gallons  (1  gallon  %  3.81).  The  consumptions  of  petrol  (Y\)  and  oil  (Y2)  are 
usually  reported  in  miles  per  gallon.  Can  you  express  and  ~y2  terms  ofx\  and 
X2  ?  Recompute  the  asymptotic  distribution  for  the  American  market. 

Exercise  4.13  Consider  the  pdf  f(x  1,^2)  =  e~^Xl+X2\  x\,  X2  >  0  and  let  U\  — 
X\  +  X2  and  U2  —  X\  —  X2.  Compute  f  (u\,  U2). 
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Exercise  4.14  Consider  the  pdf’s 


_  2 

fix \,x2)  —  4xlx2e~xi  X\ , x2  >  0, 

/ (*i ,  x2)  =  1  0  <  x\ ,  x2  <  1  and  x\  +  x2  <  1 

f(x\,x2)  =  \e~Xx  x\  >  |x2|. 


For  each  of  these  pdf’s  compute  E(X),  Var(X),  E(Xi  \X2),  E(X2\X\),  Var(X  \\X2) 
and  Var(X2\X\). 


Exercise  4.15  Consider  the  pdf  f(x \,x2)  —  2,  0  <  x\  <  x2  <  1.  Compute 

P(Xi  <  0.25),  P(X2  <  0.25)  and  P(X2  <  0.25 \Xx  <  0.25). 


Exercise  4.16  Consider  the  pdf  f(x  i,x2)  —  0  <  x\  <  2tt,  0  <  x2  <  1. 

C/i  =  sin  X\  -\J  2  log  X2  and  U2  —  cos  Xi  V- 21og  X2.  Compute  f  (u\,  u2). 


Exercise  4.17  Consider  f{x i,x2,xf)  —  k(x\  +  x2x2f  0  <  xi,x2,X3  <  1. 


(a)  Determine  k  so  that  f  is  a  valid pdfof(X\,  X2,  Xf)  —  X. 

(b)  Compute  the  (3  x  3)  matrix  Ex- 

(c)  Compute  the  (2  x  2)  matrix  of  the  conditional  variance  of  (X2,  Xf)  given 
X\  —  x\. 


(a)  Represent  the  contour  ellipses  for  a  =0;  — E 

(b)  For  a  =  ^  find  the  regions  of  X  centred  on  p  which  cover  the  area  of  the  true 
parameter  with  probability  0.90  and  0.95. 


Exercise  4.19  Consider  the  pdf 


1  + 

f{x  \,x2)  =  - — e  ' 2x2  42  x\,x2  >  0. 

8x2 

Compute  f  ( x2 )  and  f  ( x\  \x2).  Also  give  the  best  approximation  ofX\  by  a  function 
of  X2.  Compute  the  variance  of  the  error  of  the  approximation. 

Exercise  4.20  Prove  Theorem  4.6. 


Chapter  5 

Theory  of  the  Multinormal 


In  the  preceding  chapter  we  saw  how  the  multivariate  normal  distribution  comes  into 
play  in  many  applications.  It  is  useful  to  know  more  about  this  distribution,  since 
it  is  often  a  good  approximate  distribution  in  many  situations.  Another  reason  for 
considering  the  multinormal  distribution  relies  on  the  fact  that  it  has  many  appealing 
properties:  it  is  stable  under  linear  transforms,  zero  correlation  corresponds  to 
independence,  the  marginals  and  all  the  conditionals  are  also  multivariate  normal 
variates,  etc.  The  mathematical  properties  of  the  multinormal  make  analyses  much 
simpler. 

In  this  chapter  we  will  first  concentrate  on  the  probabilistic  properties  of 
the  multinormal,  then  we  will  introduce  two  “companion”  distributions  of  the 
multinormal  which  naturally  appear  when  sampling  from  a  multivariate  normal 
population:  the  Wishart  and  the  Hotelling  distributions.  The  latter  is  particularly 
important  for  most  of  the  testing  procedures  proposed  in  Chap.  7. 


5.1  Elementary  Properties  of  the  Multinormal 

Let  us  first  summarise  some  properties  which  were  already  derived  in  the  previous 
chapter. 

•  The  pdf  of*  -  Np(/jl,  E)  is 

/ (x)  =  |27tE|-1/2  exp  |  —  -(x  —  /z)T£-1  (x  —  /z)|  .  (5.1) 
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The  expectation  is  E(X)  =  /x,  the  covariance  can  be  calculated  as 
Var(X)  =  E(X  -  p)(X  -  p)T  =  E. 

•  Linear  transformations  turn  normal  random  variables  into  normal  random 
variables.  If  X  ~  Np(p,  E)  and  A(p  x  /?),c  G  R^,  then  Y  —  AX  +  c  is 
/^-variate  Normal,  i.e. 


Y  ~  +  c,  ^S^T).  (5.2) 

•  If  X  ~  Np(fi,  E),  then  the  Mahalanobis  transformation  is 

r  =  S-1/2(X  -  ID  ~  7Vp(0,Ip)  (5.3) 

and  it  holds  that 

YtY  =  (X-/x)T  JTl  (X  -  p)  ~  (5.4) 

Often  it  is  interesting  to  partition  X  into  sub- vectors  X\  and  X2.  The  following 
theorem  tells  us  how  to  correct  X2  to  obtain  a  vector  which  is  independent  of  X\. 

Theorem  5.1  Let  X  =  (*‘)  ~  Np(n,  £),  Xx  e  Rr,  Z2  e  Define  X2.\  = 

X2  -  £21  £11  X\  from  the  partitioned  covariance  matrix 

^  _  (  ^11  ^12 

V^21  ^22 

Then 

X^NripuX  h),  (5.5) 

X2.I  ~  p— r C/^2.1  ?  ^22.l)  (5.6) 


are  independent  with 


M  2.1 


/X2  —  S2iS111/Xi, 


^22.1  —  ^22  —  ^21^1/ 


(5.7) 


with  A  =  (  Tr  ,  0  ) 

with  £  =  (  ’  Xp-r  )• 


Proof 


Xx  =  AX 
X2.x  =  BX 
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Then,  by  (5.2)  X\  and  X2.1  are  both  normal.  Note  that 


Cov(Xi,X21)  =  AZB1  = 


0 


\ 


/(-E2iE711)T\ 


7 


S11  5]  12 
^21  ^22 


.4E  = 


(2r  0)  (  ^  J12  I  =  (Ell  E12)  , 

^21  ^22 


T 


hence,  A^BT  —  (S 


11 


zl2)((-Y"} 

\  -L'p—r 


=  (-S11(E2iSn1)T  +  E12) 


1 

0 

0 

1 

7 


Recall  that  E2i  =  (Ei2)t.  Hence  _4E£>T  =  —  En  Ejj1  Ei2  +  Ei2  =  0. 
Using  (5.2)  again  we  also  have  the  joint  distribution  of  (Xi,  X2.i),  namely 


X! 

X2., 


b)X 


Sn  0 

0  S22.1 


With  this  block  diagonal  structure  of  the  covariance  matrix,  the  joint  pdf  of 
(X\,  X2.1)  can  easily  be  factorised  into 


1 


fix UX2.1)  =  |27tEh|  2  exp  {  ~^(xi  -  /ri)TS111(xi  -  m) 


X 


2ttE22.i|  2exp|-7(x2.!  -  /r2.i)TS2211(x2.i  -  fi2. 1) 


from  which  the  independence  between  X\  and  X2.1  follows. 


□ 


The  next  two  corollaries  are  direct  consequences  of  Theorem  5.1. 


[X 1 

Corollary  5.1  Let  X  —  I 

V^2 

only  if  X\  is  independent  of  X2. 


Np(fi,  E),  E 


5“  ^12\  E12  =  0  if  and 
L 21  L22  ) 


The  independence  of  two  linear  transforms  of  a  multinormal  X  can  be  shown  via 
the  following  corollary. 

Corollary  5.2  If  X  Np(ji,  E)  and  given  some  matrices  A  and  B  ,  then  AX  and 
BX  are  independent  if  and  only  ifA^BT  =  0. 
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The  following  theorem  is  also  useful.  It  generalises  Theorem  4.6.  The  proof  is 
left  as  an  exercise. 

Theorem  5.2  If  X  ~  E),  A(q  x  p),  c  e  and  q  <  p,  then  Y  —  AX  +  c 

is  a  q -variate  Normal,  i.e. 


Y  ~  Nq(Afi  +  c,ASAT). 

The  conditional  distribution  of  X2  given  X\  is  given  by  the  next  theorem. 

Theorem  5.3  The  conditional  distribution  of  X2  given  X\  —  x\  is  normal  with 
mean  pt2  +  E21 E^1  (*1  —  /x  1)  and  covariance  E22.1,  re. 

(X2  |  X\  —  x\)  ~  Np-r(fi 2  +  E2iEn(*i  ~  l1 i)>  ^22.1)-  (5.8) 

Proof  Since  X2  —  X2\  +  E21  X\ ,  for  a  fixed  value  of  X\  —  x\9  X2  is  equivalent 
to  X 2. 1  plus  a  constant  term: 


(X2\X!  =  Xl)  =  (X2A  + 


which  has  the  normal  distribution  N(/i2.i  +  ^2i^n1^i,^22.i)- 


□ 


Note  that  the  conditional  mean  of  (X2  |  X\)  is  a  linear  function  of  X\  and  that  the 
conditional  variance  does  not  depend  on  the  particular  value  of  X\ .  In  the  following 
example  we  consider  a  specific  distribution. 


Example  5. 1 


Suppose  that  p  —  2,  r  —  1,  p  — 


1  -0.8 

-0.8  2 


Then  Sn  —  1,  E21  —  —0.8  and  E22.1  —  ^22  ~  ^21^1/^12  —  -  —  (0.8)2  —  1.36. 
Hence  the  marginal  pdf  of  X\  is 


fx  1(^1)  = 


exp 


and  the  conditional  pdf  of  (X2  |  X\  =  Vi)  is  given  by 


f(x2  \xi)  = 


1 

72tt(1.36) 


(x2  +  0.8xi)2  | 

2  x  (1.36)  j  ' 


As  mentioned  above,  the  conditional  mean  of  (X2  |  X\)  is  linear  in  X\.  The  shift  in 
the  density  of  (X2  |  X\)  can  be  seen  in  Fig.  5.1. 

Sometimes  it  will  be  useful  to  reconstruct  a  joint  distribution  from  the  marginal 
distribution  of  X\  and  the  conditional  distribution  (X2\X\).  The  following  theorem 
shows  under  which  conditions  this  can  be  easily  done  in  the  multinormal  framework. 


5.1  Elementary  Properties  of  the  Multinormal 
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Conditional  Normal  Densities  f{X2jXl) 


Fig.  5.1  Shifts  in  the  conditional  density  G  MVAcondnorm 


Theorem  5.4  IfX\  ~  Nr(ii\,  Sn)  and  (X2\X\  —  xi)  ~  Np-r(Ax\  +b,  £2)  where 
£2  does  not  depend  on  x\,  then  X  —  (^J)  ~  Np(/i,  E),  where 


(  /U  \ 

11  (*4^1  +  b) 

s  =  /  S„  S„^T  \ 

y  *4.E  \  \  £2  +  j\Yj  \  \  A^~  J 

Example  5.2  Consider  the  following  random  variables 


Xi~Ni(  0,1), 


X2 \X\  —  x\  ~  N2 


Using  Theorem  (5.4),  where  A  —  (2  1)T,  Z?  =  (0  1)T  and  £2  —  X2,  we  easily 

obtain  the  following  result: 


188 


5  Theory  of  the  Multinormal 


In  particular,  the  marginal  distribution  of  X2  is 


X2  ~ 


9 


thus  conditional  on  Ii,  the  two  components  of  X2  are  independent  but  marginally 
they  are  not. 

Note  that  the  marginal  mean  vector  and  covariance  matrix  of  X2  could  have 
also  been  computed  directly  by  using  (4.28)-(4.29).  Using  the  derivation  above, 
however,  provides  us  with  useful  properties:  we  have  multinormality. 


Conditional  Approximations 

As  we  saw  in  Chap.  4  (Theorem  4.3),  the  conditional  expectation  E(X2\X\)  is  the 
mean  squared  error  (MSE)  best  approximation  of  X2  by  a  function  of  X\ .  We  have 
in  this  case 


X2  =  E(X2\Xl)  +  U  =  h  2  +  £21  ^niXi-fM)  +  U.  (5.9) 

Hence,  the  best  approximation  of  X2  e  by  X\  e  W  is  the  linear 

approximation  that  can  be  written  as: 


X2  —  Po  +  B  X\  +  U 


(5.10) 


with  B  —  £21  £7/,  po  =  M2  -  Bfi  1  and  U  ~  N(0,  £22.1)- 

Consider  now  the  particular  case  where  r  —  p  —  1 .  Now  X2  e  P:  and  B  is  a  row 
vector  P'  of  dimension  (1  x  r) 


X2  =  Po  +  PTXl  +  U.  (5.11) 

This  means,  geometrically  speaking,  that  the  best  MSE  approximation  of  X2  by  a 
function  of  X\  is  a  hyperplane.  The  marginal  variance  of  X2  can  be  decomposed 
via  (5.11): 


&22  —  +  (J22.1  —  a2iE111ai2  +  (J22. 


(5.12) 


The  ratio 


<Ui  E)111ai2 

®22 


(5.13) 


is  known  as  the  square  of  the  multiple  correlation  between  X2  and  the  r  variables 
X\ .  It  is  the  percentage  of  the  variance  of  X2  which  is  explained  by  the  linear 


5.1  Elementary  Properties  of  the  Multinormal 
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approximation  /3o  +  /3TX\.  The  last  term  in  (5.12)  is  the  residual  variance 
of  X2.  The  square  of  the  multiple  correlation  corresponds  to  the  coefficient  of 
determination  introduced  in  Sect.  3.4,  see  (3.39),  but  here  it  is  defined  in  terms  of 
the  r.v.  X\  and  X2.  It  can  be  shown  that  p2.\...r  is  also  the  maximum  correlation 
attainable  between  X2  and  a  linear  combination  of  the  elements  of  X\,  the  optimal 
linear  combination  being  precisely  given  by  X\.  Note  that  when  r  —  1,  the 
multiple  correlation  p2\  coincides  with  the  usual  simple  correlation  px2X\  between 
X2  and  X\ . 

Example  5.3  Consider  the  “classic  blue”  pullover  example  (Example  3.15)  and 
suppose  that  X\  (sales),  X2  (price),  X3  (advertisement)  and  X4  (sales  assistants) 
are  normally  distributed  with 


/ 172.7  \ 
104.6 
104.0 

V  93.8/ 


and  E  = 


/  1037.21  \ 

-80.02  219.84 
1430.70  92.10  2624.00 
V  271.44  -91.58  210.30  177.36/ 


(These  are  in  fact  the  sample  mean  and  the  sample  covariance  matrix  but  in  this 
example  we  pretend  that  they  are  the  true  parameter  values.) 

The  conditional  distribution  of  X\  given  (X2,  X3 ,  X4)  is  thus  an  univariate  normal 
with  mean 


(X2  —  pi2  \ 

X3  —  fi3  I  —  65.670  —  0.216X2  T  0.485X3  +  O.844X4 
X4-  IM/ 


and  variance 


On. 2  —  ^11  —  o'i2E221cr2i  —  96.761 

The  linear  approximation  of  the  sales  (Xi)  by  the  price  (X2),  advertisement  (X3) 
and  sales  assistants  (X4)  is  provided  by  the  conditional  mean  above.  (Note  that 
this  coincides  with  the  results  of  Example  3.15  due  to  the  particular  choice  of 
\i  and  E.)  The  quality  of  the  approximation  is  given  by  the  multiple  correlation 

p\  234  =  ai2^  °21  =  0.907.  (Note  again  that  this  coincides  with  the  coefficient  of 

determination  r2  found  in  Example  3.15.) 

This  example  also  illustrates  the  concept  of  partial  correlation.  The  correlation 
matrix  between  the  four  variables  is  given  by 

/  1  -0.168  0.867  0.633\ 

_  -0.168  1  0.121  -0.464 

0.867  0.121  1  0.308  ’ 

V  0.633  -0.464  0.308  1  / 
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so  that  the  correlation  between  X\  (sales)  and  X2  (price)  is  —0. 168.  We  can  compute 
the  conditional  distribution  of  (X\,  X2)  given  (X3,  X4),  which  is  a  bivariate  normal 
with  mean: 

( ^  A  1  (  °43  al4  W  <733  034  \  /  X3  -  /Z3  \  _  /  32.516  +  0.467^3  +  0.977X4 

y/^2 y  \V23  &24  J  yo-43  cr 44  J  yX4  —  1x4  J  y  153.644  +  0. 0852^3  —  0.617^4 

and  covariance  matrix: 

f  ^11  ^12^  _  f  &13  CT 14  \  f  O33  C>34\  ( &3\  &32  A  _  (  104.006  \ 

va21  0"22  )  Va23  <724  )  Va43  ^44  )  \  041  O42  )  \ -33.574  155.592 ) 


In  particular,  the  last  covariance  matrix  allows  the  partial  correlation  between 
X\  and  X2  to  be  computed  for  a  fixed  level  of  X2  and  X4: 


PX 1Z2IZ3X4 


-33.574 

7104.006- 155.592 


-0.264, 


so  that  in  this  particular  example  with  a  fixed  level  of  advertisement  and  sales 
assistance,  the  negative  correlation  between  price  and  sales  is  more  important  than 
the  marginal  one. 

Q MVAbluepullover 
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'  Summary 

If  X  ~  Np(fi,X),  then  a  linear  transformation  AX  +  c,  A(q  x  p ), 

where  c  has  distribution  Nq(A/JL  +  c,  AT<At). 


Two  linear  transformations  AX  and  BX  with  X  Np(ji,  S)  are 
independent  if  and  only  if  AXBT  =  0. 

If  X\  and  X2  are  partitions  of  X  ~  Np(ijL,'E),  then  the  conditional 
distribution  of  X2  given  X\  =  X\  is  again  normal. 

^  In  the  multivariate  normal  case,  X\  is  independent  of  X2  if  and  only 
if  X  12  =  0. 

^  The  conditional  expectation  of  (X2\X\)  is  a  linear  function  if 

(£) 

^  The  multiple  correlation  coefficient  is  defined  as  ,  r  — 

<721 

022 

^  The  multiple  correlation  coefficient  is  the  percentage  of  the  vari¬ 
ance  of  X2  explained  by  the  linear  approximation  /3o  +  /3T X\. 


5.2  The  Wishart  Distribution 
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5.2  The  Wishart  Distribution 


The  Wishart  distribution  (named  after  its  discoverer)  plays  a  prominent  role  in  the 
analysis  of  estimated  covariance  matrices.  If  the  mean  of  X  ~  Np(fi,  E)  is  known 
to  be  /x  =  0,  then  for  a  data  matrix  X(n  x  p)  the  estimated  covariance  matrix  is 
proportional  to  XT  X.  This  is  the  point  where  the  Wishart  distribution  comes  in, 
because  A 4(p  x  p)  -  xTx  —  xixJ  has  a  Wishart  distribution  Wp  (E ,  n). 

Example  5.4  Set  p  —  1 ,  then  for  X  ~  N\  (0,  a2)  the  data  matrix  of  the  observations 

n 

A  =  (x\, . . .  ,xn)T  with  A4  =  XT X  —  Xj Xj 

i  =  1 

leads  to  the  Wishart  distribution  W\(cr2,  n )  =  <J2xl •  The  one-dimensional  Wishart 
distribution  is  thus  in  fact  a  j2  distribution. 

When  we  talk  about  the  distribution  of  a  matrix,  we  mean  of  course  the  joint 
distribution  of  all  its  elements.  More  exactly:  since  A4  =  XT X  is  symmetric  we 
only  need  to  consider  the  elements  of  the  lower  triangular  matrix 


M  = 


(  mn 
m2\  m  22 


\ 


(5.14) 


\mpi  m  p2  ...  m  pp  J 


Hence  the  Wishart  distribution  is  defined  by  the  distribution  of  the  vector 


(m li , . . . ,  m p  \ ,  TYI22 >  •  •  •  ?  m p2, . . . ,  m pp) 


(5.15) 


Linear  transformations  of  the  data  matrix  A  also  lead  to  Wishart  matrices. 


Theorem  5.5  If  M.  ~  WpfX,n)  and  13(p  x  q),  then  the  distribution  ofB^jVlB  is 
Wishart  Wq(BT E£>,  n). 

With  this  theorem  we  can  standardise  Wishart  matrices  since  with  B  —  E-1/2 
the  distribution  of  E_1/2A1E_1/2  is  Wp(T,  n).  Another  connection  to  the 
X2 -distribution  is  given  by  the  following  theorem. 


Theorem  5.6  If  M.  ~  Wp(E,m),  and  a  e  with  a1  Ha  ^  0,  then  the 

aT  Ma  9 

distribution  of  — ^ - is  x,n  • 

a'  Ha 

This  theorem  is  an  immediate  consequence  of  Theorem  5.5  if  we  apply  the  linear 
transformation  x  1-^  aT x.  Central  to  the  analysis  of  covariance  matrices  is  the  next 
theorem. 


192 


5  Theory  of  the  Multinormal 


Theorem  5.7  (Cochran)  Let  X(n  x  p)  be  a  data  matrix  from  a  Np(  0,  E)  distribu¬ 
tion  and  let  C(n  x  n)  be  a  symmetric  matrix. 

(a)  XJ CX  has  the  distribution  of  weighted  Wishart  random  variables,  i.e. 


XTCX  =  J2*-iWp(X,  1), 

i  =  1 


where  A/,  /  =  l, ...  ,n,  are  the  eigenvalues  ofC. 

(b)  XTCX  is  Wishart  if  and  only  if  C2  —  C.  In  this  case 


XTCX  ~  Wp(?:,r), 


and  r  —  rank(C)  =  tr(C). 

(c)  nS  —  XtHX  is  distributed  as  Wp(fL,n  —  1)  (note  that  S  is  the  sample 
covariance  matrix ). 

(d)  x  and  S  are  independent. 

The  following  properties  are  useful: 

1.  If  M.  ~  Wp(Ti,  n),  then  E(.M)  —  nil. 

2.  If  A4/  are  independent  Wishart  Wp(^,nf)  i  =  1, ...  ,k,  then  A4  =  J2f=i  Mi  ~ 
Wp(Yi,n)  where  n  —  Y^=i  ni  • 

3.  The  density  of  WpfE,  n  —  1)  for  a  positive  definite  M.  is  given  by: 


/S, 77-1  C^)  = 


2\p{n-\)  \  ^\  \{n-\)  Y[^_x  T{^}’ 


(5.16) 


where  T  is  the  gamma  function:  T (z)  =  /0°°  tz  le  fdt. 

For  further  details  on  the  Wishart  distribution,  see  Mardia,  Kent,  and  Bibby 
(1979). 


UU4 


'  Summary 

^  The  Wishart  distribution  is  a  generalisation  of  the  j2 -distribution. 
In  particular  W\(o2 ,n)  —  o2 /2r 

^  The  empirical  covariance  matrix  S  has  a  ^WpfE,n  —  1)  distribu¬ 
tion. 

^  In  the  normal  case,  x  and  S  are  independent. 

^  For  A4  ~  Wp(Yi,m), 
aT Ma/aT^a  ~  xlr 


5.3  Hotelling’s  T2 -Distribution 
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5.3  Hotelling’s  J  2 -Distribution 

Suppose  that  Y  e  Rp  is  a  standard  normal  random  vector,  i.e.  Y  ~  Np(0,X), 
independent  of  the  random  matrix  A4  ~  WP(I,  n).  What  is  the  distribution  of 
Yt A4_1T?  The  answer  is  provided  by  the  Hotelling  T2 -distribution:  n  YT  Y 
is  Hotelling  T2  n  distributed. 

The  Hotelling  redistribution  is  a  generalisation  of  the  Student  £ -distribution. 
The  general  multinormal  distribution  N  {pi,  E)  is  considered  in  Theorem  5.8. 
The  Hotelling  T2 -distribution  will  play  a  central  role  in  hypothesis  testing  in 
Chap.  7. 

Theorem  5.8  If  X  Np(/jL,  E)  is  independent  of  A4  ~  WpfE,n),  then 

n(X  -  /x)T M~\X  -  fi)  ~  T2n . 

Corollary  5.3  If  x  is  the  mean  of  a  sample  drawn  from  a  normal  population 
Np(fi,  E)  and  S  is  the  sample  covariance  matrix,  then 

(n  —  1 ) (x  —  p)TS~l(x  —  p)  —  n(x  —  p)TSfl(x  —  pi)  ~  T2n_x.  (5.17) 

Recall  that  Su  —  -^S  is  an  unbiased  estimator  of  the  covariance  matrix. 
A  connection  between  the  Hotelling  T  -  and  the  F -distribution  is  given  by  the  next 
theorem. 

Theorem  5.9 


np 


n  —  p  +  1 


Fp,n—p+\ 


Example  5.5  In  the  univariate  case  ( p  —  1),  this  theorem  boils  down  to  the 
well-known  result: 


x  —  pt 

vs;/  Vn 


T 


1 ,77  —  1 


—  F 1  n — 1  —  t 


n— 1 


For  further  details  on  Hotelling  redistribution  see  Mardia  et  al.  (1979).  The  next 
corollary  follows  immediately  from  (3.23),  (3.24)  and  from  Theorem  5.8.  It  will  be 
useful  for  testing  linear  restrictions  in  multinormal  populations. 
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Corollary  5.4  Consider  a  linear  transform  of  X  ~  Np(/jl,  E),  Y  —  AX  where 
A(q  x  p)  with  (q  <  p).  Ifx  and  Sx  are  the  sample  mean  and  the  covariance  matrix, 
we  have 


y  =  AX  ~  Nq 


Apt,  -AT, A 

n 


T 


nSy  =  ^ASZ.4T  -  -  1) 


(n  -  l)(Ax  -  Apt)T (ASxAt)  1  (Ax  -  *4/z)  ~  Tq,n_ \ 


The  T 2  distribution  is  closely  connected  to  the  univariate  t -statistic. 
In  Example  5.4  we  described  the  manner  in  which  the  Wishart  distribution 
generalises  the  /2- distribution.  We  can  write  (5.17)  as: 


— \T 


T2  =  y/n(x  —  fi) 


T 


T!j  =  l(Xj  -x)(Xj  -x) 


-1 


n  —  1 


Vw(x  —  fl) 


which  is  of  the  form 


multivariate  normal  \ 
random  vector  I 


T 


/  Wishart  random  \ 


-l 


matrix 


degrees  of  freedom 


f  multivariate  normal 
\  random  vector 


7 


This  is  analogous  to 


t1  —  \fn{x  —  ix)(s2)  1  *Jn(x  —  fi) 


or 


normal  \ 
random  variable  ) 


(  x2 -random  ^ 
variable 


-l 


degrees  of  freedom 


(  normal 
V  random  variable 


7 


for  the  univariate  case.  Since  the  multivariate  normal  and  Wishart  random  variables 
are  independently  distributed,  their  joint  distribution  is  the  product  of  the  marginal 
normal  and  Wishart  distributions.  Using  calculus,  the  distribution  of  T2  as  given 
above  can  be  derived  from  this  joint  distribution. 


5.4  Spherical  and  Elliptical  Distributions 
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>  j'-  Summary 

^  Hotelling’s  T 2 -distribution  is  a  generalisation  of  the  t -distribution. 
In  particular  T2n  —  tn . 

(n  —  \)(x  —  a)TS~l  (x  —  a)  has  a  distribution. 

^  The  relation  between  Hotelling’s  T 2-  and  Fisher’s  F -distribution  is 
given  by  T2p  n  = 


5.4  Spherical  and  Elliptical  Distributions 

The  multinormal  distribution  belongs  to  the  large  family  of  elliptical  distributions 
which  has  recently  gained  a  lot  of  attention  in  financial  mathematics.  Elliptical 
distributions  are  often  used,  particularly  in  risk  management. 

Definition  5.1  A  (p  x  1)  random  vector  Y  is  said  to  have  a  spherical  distribution 
Sp (0)  if  its  characteristic  function  0y(t)  satisfies:  0y(t)  =  f(tTt)  for  some 
scalar  function  0(.)  which  is  then  called  the  characteristic  generator  of  the  spherical 
distribution  50(0).  We  will  write  Y  SpU>). 

This  is  only  one  of  several  possible  ways  to  define  spherical  distributions.  We  can 
see  spherical  distributions  as  an  extension  of  the  standard  multinormal  distribution 
Np(  0,XP). 

Theorem  5.10  Spherical  random  variables  have  the  following  properties: 

1.  All  marginal  distributions  of  a  spherically  distributed  random  vector  are 
spherical. 

2.  All  the  marginal  characteristic  functions  have  the  same  generator. 

3.  Let  X  ~  Spiyj)),  then  X  has  the  same  distribution  as  ru ^  where  u^p)  is  a  random 
vector  distributed  uniformly  on  the  unit  sphere  surface  in  and  r  >  0  is  a 
random  variable  independent  of  u^p\  IfE(r2)  <  oo,  then 

E(r2) 

E(X)  =  0  ,  Cov(X)  =  Xp . 

P 

The  random  radius  r  is  related  to  the  generator  0  by  a  relation  described  in  Fang, 
Kotz,  and  Ng  (1990,  p.  29).  The  moments  of  X  ~  Sp  ((/>),  provided  that  they  exist, 
can  be  expressed  in  terms  of  one-dimensional  integral. 

A  spherically  distributed  random  vector  does  not,  in  general,  necessarily  possess 
a  density.  However,  if  it  does,  the  marginal  densities  of  dimension  smaller  than 
p  —  1  are  continuous  and  the  marginal  densities  of  dimension  smaller  than  p  —  2 
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are  differentiable  (except  possibly  at  the  origin  in  both  cases).  Univariate  marginal 
densities  for  p  greater  than  2  are  non-decreasing  on  (— oo,0)  and  non-increasing 
on  (0,  oo). 

Definition  5.2  A  (p  x  1)  random  vector  X  is  said  to  have  an  elliptical  distribution 
with  parameters  fi(p  x  1)  and  E  (p  x  p)  if  X  has  the  same  distribution  as  /x  +  AT  Y, 
where  Y  ~  Sr  (0)  and  A  is  a  (kxp)  matrix  such  that  ArA  =  E  with  rank(E)  =  k. 
We  shall  write  X  ~  ECp (/x,  E,  0). 

Remark  5.1  The  elliptical  distribution  can  be  seen  as  an  extension  of  Np(p,  E). 

Example  5.6  The  multivariate  t -distribution.  Let  Z  ~  Np( 0,Xp)  and  s  ~  be 
independent.  The  random  vector 


has  a  multivariate  t -distribution  with  m  degrees  of  freedom.  Moreover  the 
t -distribution  belongs  to  the  family  of  -dimensional  spherical  distributions. 


Example  5.7  The  multinormal  distribution.  Let  X  ~  Np{p,Y7).  Then 
X  ~ ECp(ii ,  E,0)  and  0(w)  =  exp(— m/2).  Figure  4.3  shows  a  density  surface 

of  the  multivariate  normal  distribution:  / (x)  =  det(2jrE)-5  exp{  —  |(x  —  /x)TE-1 


(x  —  fi)}  with  E  = 


1  0.6 

0.6  1 


and  /x  = 


Notice  that  the  density  is  constant  on 


ellipses.  This  is  the  reason  for  calling  this  family  of  distributions  “elliptical” 


Theorem  5.11  Elliptical  random  vectors  X  have  the  following  properties: 


1.  Any  linear  combination  of  elliptically  distributed  variables  are  elliptical. 

2.  Marginal  distributions  of  elliptically  distributed  variables  are  elliptical. 

3.  A  scalar  function  0(.)  can  determine  an  elliptical  distribution  ECp(gi,  E,  f)  for 
every  /x  £  and  E  >  0  with  rank(E)  =  k  iff  f(tTt)  is  a  p-dimensional 
characteristic  function. 

4.  Assume  that  X  is  non- degenerate.  If  X  ~  ECp(p,  E,0)  and  X^ECP 
(/x*,  E*,  0*),  then  a  constant  c  >  0  exists  that 


n  =  11*,  E=cE*, 


In  other  words  E,0,*4  are  not  unique,  unless  we  impose  the  condition  that 
det(E)  =  1. 

5.  The  characteristic  function  of  X,  f(t)  —  E(elt  x )  is  of  the  form 

0(f)  =  e1?T/x0(fTEf) 


for  a  scalar  function  0. 


5.5  Exercises 
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6.  X  ~  ECp(ti,  E,  0)  with  rank(E)  =  k  iff  X  has  the  same  distribution  as: 

p  +  rATu(k)  (5.18) 

where  r  >  0  is  independent  of  which  is  a  random  vector  distributed 
uniformly  on  the  unit  sphere  surface  in  M.k  and  A  is  a  (k  x  p)  matrix  such  that 
ArA  =  E. 

7.  Assume  that  X  ~  ECp(pt,  E,  0)  rmd  E(r2)  <  oo. 

E(X)  =  M  Cov(X)  =  = -20t(O)S. 

8.  Assume  that  X  ~  ECp(pt,  E,  0)  with  rank(E)  =  k.  Then 

Q{X)  =  (X  -  n)T  YT\X  - 
has  the  same  distribution  as  r2  in  Eq.  (5.18). 


5.5  Exercises 


Exercise  5.1 

matrices  A  — 


Consider  X  ~  A^(/x,  E)  with  pt  —  (2,  2)T  and  E  = 


1  0 
0  1 


rmd  die 


1 


T 


,B  = 


1 
-1 


T 


57iow  are  independent. 


Exercise  5.2 
Exercise  5.3 
Exercise  5.4 


Prove  Theorem  5.4. 

Prove  proposition  (c)  of  Theorem  5.7. 
Let 


X 


and 


Y  X 


(a)  Determine  the  distribution  of  Y2  \  Y\. 

(b)  Determine  the  distribution  ofW  —  X  —  Y . 
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Exercise  5.5  Consider 


~  A/3  (/i,  E).  Compute  pi  and  E  knowing  that 


Z~NX(-Z,  1) 


Mz|r  —  — 


1 

3 


x  7,z 


-  M(2  +  27  +  3Z,  1). 


Determine  the  distributions  of  X  \  Y  and  of  X  \  Y  +  Z. 
Exercise  5.6  Knowing  that 


Z 

Y  |  Z 
x  7,z 


~  2Vt(o,  i) 

~  2Vi(l  +  Z,  1) 
~  ^(1-7,1) 


(a)  yzzid  the  distribution  of 

(b)  find  the  distribution  of 


and  ofY  \  X,  Z. 


U 

V 


(c)  compute  E(7  |  U  =  2). 


Exercise  5.7  Suppose 
that 


~  A^C/x,  E)  with  E  positive  definite.  Is  it  possible 


(a)  ilx\y  —  3Z2, 

(b)  &xx\ y  —  2  +  72, 

(c)  Mz| y  =  3  —  7,  and 

(d)  (JXx\Y  =  5? 
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Exercise  5.8  Let  X  ~  Nt> 


(a)  Find  the  best  linear  approximation  ofX 3  by  a  linear  function  ofX\  and  X2  and 
compute  the  multiple  correlation  between  X 3  and  (X\ ,  Xf). 

(b)  Let  Z\  —  X2  —  X3,  Z2  —  X2  +  X3  and  (Z3  |  Z\,  Zf)  ~  N\(Z\  Z2 , 10). 


(Zl 

Compute  the  distribution  of  I  Z2 

\Z3 


Exercise  5.9  Let  ( X ,  Y,  Z)T  a  trivariate  normal  r.v.  with 


Y  |  Z 
Z  I  X 
X 


~  N\(2Z,  24) 

-  7V!(2X  +  3,14) 

~  M(l, 4) 


aad  Pxf  =  0.5. 


Find  the  distribution  of  (X,  Y,  Z)T  aad  compute  the  partial  correlation  between 
X  and  Y  for  fixed  Z.  Do  you  think  it  is  reasonable  to  approximate  X  by  a  linear 
function  of  Y  and  Z  ? 


Exercise  5.10  Let  X  ~  N4 


VV4/ 


/4 1  2  4\  \ 
14  2  1 
2  2  16  1 
\4  1  19// 


fa)  Gzv£  best  linear  approximation  ofX 2  as  a  function  of(X\,  Xf)  and  evaluate 

the  quality  of  the  approximation. 

(b)  Give  the  best  linear  approximation  of  X 2  as  a  function  of  (X\,  X$,  Xf)  and 
compare  your  answer  with  part  (a). 


Exercise  5.11  Prove  Theorem  5.2. 

(Hint:  complete  the  linear  transformation  Z  —  ^  X  +  c  ^  and  then  use 

Theorem  5.1  to  get  the  marginal  of  the  first  q  components  of  Z.) 


Exercise  5.12  Prove  Corollaries  5.1  and  5.2. 


Chapter  6 

Theory  of  Estimation 


We  know  from  our  basic  knowledge  of  statistics  that  one  of  the  objectives  in 
statistics  is  to  better  understand  and  model  the  underlying  process  which  generates 
data.  This  is  known  as  statistical  inference:  we  infer  from  information  contained 
in  sample  properties  of  the  population  from  which  the  observations  are  taken. 
In  multivariate  statistical  inference,  we  do  exactly  the  same.  The  basic  ideas 
were  introduced  in  Sect.  4.5  on  sampling  theory:  we  observed  the  values  of  a 
multivariate  random  variable  X  and  obtained  a  sample  X  —  {xt } "=1 .  Under  random 
sampling,  these  observations  are  considered  to  be  realisations  of  a  sequence  of  i.i.d. 
random  variables  X\, ...  ,Xn  where  each  X )  is  a  p -variate  random  variable  which 
replicates  the  parent  or  population  random  variable  X .  In  this  chapter,  for  notational 
convenience,  we  will  no  longer  differentiate  between  a  random  variable  Xj  and  an 
observation  of  it,  ,  in  our  notation.  We  will  simply  write  X\  and  it  should  be  clear 
from  the  context  whether  a  random  variable  or  an  observed  value  is  meant. 

Statistical  inference  infers  from  the  i.i.d.  random  sample  X  the  properties  of 
the  population:  typically,  some  unknown  characteristic  0  of  its  distribution.  In 
parametric  statistics,  0  is  a  fc-variate  vector  9  e  Rk  characterising  the  unknown 
properties  of  the  population  pdf  f(x;6 ):  this  could  be  the  mean,  the  covariance 
matrix,  kurtosis,  etc. 

The  aim  will  be  to  estimate  0  from  the  sample  X  through  estimators  6  which 

/V  /V  /V 

are  functions  of  the  sample:  0  =  6(X).  When  an  estimator  0  is  proposed,  we  must 
derive  its  sampling  distribution  to  analyse  its  properties. 

In  this  chapter  the  basic  theoretical  tools  are  developed  which  are  needed  to 
derive  estimators  and  to  determine  their  properties  in  general  situations.  We  will 
basically  rely  on  the  maximum  likelihood  theory  in  our  presentation.  In  many 
situations,  the  maximum  likelihood  estimators  (MLEs)  indeed  share  asymptotic 
optimal  properties  which  make  their  use  easy  and  appealing. 

We  will  illustrate  the  multivariate  normal  population  and  also  the  linear  regres¬ 
sion  model  where  the  applications  are  numerous  and  the  derivations  are  easy  to 
do.  In  multivariate  setups,  the  MLE  is  at  times  too  complicated  to  be  derived 
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analytically.  In  such  cases,  the  estimators  are  obtained  using  numerical  methods 
(nonlinear  optimisation).  The  general  theory  and  the  asymptotic  properties  of 
these  estimators  remain  simple  and  valid.  The  following  Chap.  7  concentrates  on 
hypothesis  testing  and  confidence  interval  issues. 


6.1  The  Likelihood  Function 

Suppose  that  {v/}'7=1  is  an  i.i.d.  sample  from  a  population  with  pdf  f(x;6).  The 
aim  is  to  estimate  6  e  which  is  a  vector  of  unknown  parameters.  The  likelihood 
function  is  defined  as  the  joint  density  L(X;6)  of  the  observations  considered  as 
a  function  of  0 : 


L(x-,e)  =  Y\f(xr,e),  (6.i) 

7—1 

where  X  denotes  the  sample  of  the  data  matrix  with  the  observations  xj , ...  ,xj  in 
each  row.  The  MLE  of  0  is  defined  as 


0  =  argma  xL(X;6). 

e 

Often  it  is  easier  to  maximise  the  log-likelihood  function 

t(X;  0)  =  log  L(X;  0),  (6.2) 

which  is  equivalent  since  the  logarithm  is  a  monotone  one-to-one  function.  Hence 

/V 

0  =  argma  xL(X;6)  =  argma  xl(X;0). 

o  e 

The  following  examples  illustrate  cases  where  the  maximisation  process  can  be 

/V 

performed  analytically,  i.e.,  we  will  obtain  an  explicit  analytical  expression  for  6. 
Unfortunately,  in  other  situations,  the  maximisation  process  can  be  more  intricate, 
involving  nonlinear  optimisation  techniques.  In  the  latter  case,  given  a  sample  X 
and  the  likelihood  function,  numerical  methods  will  be  used  to  determine  the  value 
of  0  maximising  L(X;0)  or  £(X;  6).  These  numerical  methods  are  typically  based 
on  Newton-Raphson  iterative  techniques. 

Example  6.1  Consider  a  sample  {x?  }"=1  from  Np(/jl ,  X),  i.e.,  from  the  pdf 


f(x;0)  =  (2 jt)  p ^  exp 


-  0)T(x  -  6) 
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where  0  —  /x  e  is  the  mean  vector  parameter.  The  log-likelihood  is  in  this  case 

n  .  n 

£(X;  0)  =  J2  !°g {/(*,■;  0)}  =  log  (2jt)-"p/2  -  -  -  0)T(xi  -  9).  (6.3) 

i  =  1  i  =  1 


The  term  (x7  —  0)T  (xj  —  0)  equals 

(x7  —  x)T(x7  —  x)  +  (x  —  0)T  (x  —  0)  +  2(x  —  0)T  (x7  —  x). 
Summing  this  term  over  i  =  1 we  see  that 

77  77 

YXxj  —  0)T  (xj  —  0)  =  YXxj  —  x)T(x7-  —  x)  +  n  (x  —  0)T  (x  —  0). 

7—1  7—1 

Hence 


1 

l(X\6)  —  \og(2Tt)~np^2 - ^~^(x7  —  x)T(xf  —  x) - (x  —  0)T  (x  —  0). 

7—1 

Only  the  last  term  depends  on  0  and  is  obviously  maximised  for 


9  —  jl  —  x. 

Thus  x  is  the  MLE  of  0  for  this  family  of  pdfs  f(x,  9). 

A  more  complex  example  is  the  following  one  where  we  derive  the  MLEs  for  /x 
and  E. 

Example  6.2  Suppose  {x*}"=1  is  a  sample  from  a  normal  distribution  Np(p, ,  E). 
Here  9  —  (/x,  E)  with  E  interpreted  as  a  vector.  Due  to  the  symmetry  of  E  the 
unknown  parameter  9  is  in  fact  {p  +  \p(p  +  l)}-dimensional.  Then 

(  i  n  ) 

L(X;6 )  =  |2^S|_"/2exp  j -- -/x)TS_1(x,-  -/x)>  (6.4) 

(  1=1  ) 

and 

1  n 

t(X;  9)  =  ~  log  |2jtS|  -  -  £(jc,  -  -  n).  (6.5) 

7  =1 

The  term  (x7  —  /x)TE_1(x7  —  /x)  equals 


(x7  —  x)TE  1  (x7  —  x)  +  (x  —  /x)tE  !(x  — /x)  +  2(x  — /x)tE  ^x,  —  x). 
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Summing  this  term  over  i  —  1 we  see  that 

n  n 

7>  —  /x)T£_1(x7-  —  i±)  =  y^(xj  —  x)TE_1(x7  —  x)  +  n(x  —  /x)TE_1(x  —  /x). 

/  =  i  /  =  i 

Note  that  from  (2.14) 

(xi  —  x)tE_1(x7-  —  x)  =  tr  |(xz-  —  x)TS_1(x/  —  x)} 

=  tr  IE-1  (xi  —  x)(xj  —  x)T}  . 

Therefore,  by  summing  over  the  index  i  we  finally  arrive  at 

n  (  n  t 

yXxi  -  -  fi)  =  tr  <  E_1  y(xt  -  x)(xi  -  x)T> 

i  =  1  (  /  =  1  ) 

+n(x  —  /x)T£_1(x  —  /x) 

—  tr{E_1/7cS}  +  n(x  —  /x)T£_1  (x  —  /x). 


Thus  the  log-likelihood  function  for  Np(/i,  E)  is 


n  ^  .  _  _  ft  _ _  n 


t(X\0)  =  — —  log  |27rS|  —  —  tr{S_icS}  —  -(3c  -  /x)TE_1(x  -  /x).  (6.6) 

^  2  2 


We  can  easily  see  that  the  third  term  is  maximised  by  /x  =  x.  In  fact  the  MLEs  are 
given  by 


/V 


The  derivation  of  E  is  a  lot  more  complicated.  It  involves  derivatives  with  respect 
to  matrices  with  their  notational  complexities  and  will  not  be  presented  here;  for 
more  elaborate  proof,  see  Mardia,  Kent  and  Bibby  (1979,  pp.  103-104).  Note  that 
the  unbiased  covariance  estimator  Su  —  -S  is  not  the  MLE  of  E ! 

Example  6.3  Consider  the  linear  regression  model  jy  =  /3TV/  +£/  for  /  =  1 , . . . ,  ft, 
where  £/  is  i.i.d.  and  N(0,  a2)  and  where  G  Rp .  Here  6  —  (/3T ,  a)  is  a  (p  +  1)- 
dimensional  parameter  vector.  Denote 
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Then 


L(y,X\Q) 


n 


exp 


and 


l(y- X:  e)  =  108  {  -  b  £>'  - 

v  7  1  =  1 

=  ~  log(2jr)  -  n  loga  -  ^(y  -  XP)T(y  -  XP) 

=  ~  log(2^)  -  n  loga  -  +  PTXTXp  -  2 pTXTy). 

Differentiating  w.r.t.  the  parameters  yields 

9  1 

—l  = - ~(2XTXp  -  2 XTy) 

dp  2  a2  y  J 

-k-1  =  --  +  -4{(j  -  xp)T(y  -  xp)}. 

OO  G  Gi  v  7 

Note  that  denotes  the  vector  of  the  derivatives  w.r.t.  all  components  of  ft  (the 

/V 

gradient).  Since  the  first  equation  only  depends  on  f},  we  start  with  deriving  /3. 

XTxp  =  XTy,  hence  P  =  (XTX)~l  XTy 


Plugging  /3  into  the  second  equation  gives 

b  = —(y  -  xp)T(y  -  xp),  hence  <j2  = -\\y  -  X f}\\2 , 
g  crJ  n 

where  1 1  •  1 12  denotes  the  Euclidean  vector  norm  from  Sect.  2.6.  We  see  that  the  MLE 

/V 

P  is  identical  with  the  least  squares  estimator  (3.52).  The  variance  estimator 

a2  =  -J2(yl-PJx)2 

n  z ' 

i  =  1 

is  nothing  else  than  the  residual  sum  of  squares  (RSS)  from  (3.37)  generalised  to  the 
case  of  multivariate  x7 .  Note  that  when  the  x7  are  considered  to  be  fixed,  we  have 

E(y)  =  Xfi  and  Var(y)  =  o2ln. 
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Then,  using  the  properties  of  moments  from  Sect.  4.2  we  have 


E (fi)  =  (XTX)~lXT  E(y)  =  p, 

(6.7) 

Var(j6)  =  a2(XTX)~\ 

(6.8) 

uuj  * 


'  -A-  Summary 

>  If  {xi}ni=x  is  an  i.i.d.  sample  from  a  distribution  with  pdf  / (x;  0 ), 
then  L(X\6)  —  Yll=i  f(xi>  is  the  likelihood  function.  The 
MLE  is  that  value  of  0  which  maximises  L(X\6).  Equivalently 
one  can  maximise  the  log-likelihood  l(X\Q). 

>  The  MLEs  of  /x  and  E  from  a  Np(fi ,  E)  distribution  are  /x  —  x 

/V 

and  E  =  S.  Note  that  the  MLE  of  E  is  not  unbiased. 

>  The  MLEs  of  /3  and  a  in  the  linear  model  y  —  + 

s,  £  ~  Nn(0,(j2X )  are  given  by  the  least  squares  estimator 
ft  =  (T,tT,)_1T,T3;  and  d2  =  ^||j  —  T,^||2.  E(^)  =  /3  and 
Var(^)  =  a2^7^)-1. 


6.2  The  Cramer-Rao  Lower  Bound 

As  pointed  out  above,  an  important  question  in  estimation  theory  is  whether  an 

A 

estimator  0  has  certain  desired  properties,  in  particular,  if  it  converges  to  the 
unknown  parameter  0  it  is  supposed  to  estimate.  One  typical  property  we  want  for 
an  estimator  is  unbiasedness,  meaning  that  on  the  average,  the  estimator  hits  its 

/V 

target:  E (0)  =  0.  We  have  seen  for  instance  (see  Example  6.2)  that  x  is  an  unbiased 
estimator  of  /x  and  S  is  a  biased  estimator  of  E  in  finite  samples.  If  we  restrict 
ourselves  to  unbiased  estimation,  then  the  natural  question  is  whether  the  estimator 
shares  some  optimality  properties  in  terms  of  its  sampling  variance.  Since  we  focus 
on  unbiasedness,  we  look  for  an  estimator  with  the  smallest  possible  variance. 

In  this  context,  the  Cramer-Rao  lower  bound  will  give  the  minimal  achievable 
variance  for  any  unbiased  estimator.  This  result  is  valid  under  very  general  regularity 
conditions  (discussed  below).  One  of  the  most  important  applications  of  the 
Cramer-Rao  lower  bound  is  that  it  provides  the  asymptotic  optimality  property 
of  MLEs.  The  Cramer-Rao  theorem  involves  the  score  function  and  its  properties 
which  will  be  derived  first. 
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The  score  function  s(X;  9)  is  the  derivative  of  the  log  likelihood  function  w.r.t. 

d  €  Rk 


The  covariance  matrix  Tn  —  0)}  is  called  the  Fisher  information  matrix. 

In  what  follows,  we  will  give  some  interesting  properties  of  score  functions. 

Theorem  6.1  If  s  —  s(X',6)  is  the  score  function  and  if  0  —  t  —  t(X ,  6)  is  any 
function  of  X  and  0,  then  under  regularity  conditions 

E<I'T,  =  W  E('T)  -  E  (w ) '  (610) 

The  proof  is  left  as  an  exercise  (see  Exercise  6.9).  The  regularity  conditions  required 
for  this  theorem  are  rather  technical  and  ensure  that  the  expressions  (expectations 
and  derivations)  appearing  in  (6. 10)  are  well  defined.  In  particular,  the  support  of  the 
density  f(x;9)  should  not  depend  on  6.  The  next  corollary  is  a  direct  consequence. 

/V 

Corollary  6.1  If  s  —  s(X;6)  is  the  score  function,  and  0  —  t  —  t(X)  is  any 
unbiased  estimator  of  0  (i.e.,  E(^)  —  6),  then 

E(rfT)  =  Cov(s,  t)  =lk.  (6.11) 

Note  that  the  score  function  has  mean  zero  (see  Exercise  6.10). 

E{s(X;  0)}  =  0.  (6.12) 

Hence,  E(^T)  =  Var(s)  =  Tn  and  by  setting  s  —  t  in  Theorem  6.1  it  follows  that 

Remark  6.1  If  x\, . . . ,  xn  are  i.i.d.,  Tn  —  nT\  where  T\  is  the  Fisher  information 
matrix  for  sample  size  n  —  1 . 

Example  6.4  Consider  an  i.i.d.  sample  {xf}ni=x  from  Np{6,T).  In  this  case  the 
parameter  0  is  the  mean  p.  It  follows  from  (6.3)  that 

s(X;  9)  = 

=  -\ le  j  E<^  -  »>T(*  -  «) 

v  i  =  \ 

—  n(x  —  6). 
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Hence,  the  information  matrix  is 

Tn  —  Var{ft(x  —  0)}  —  nXp. 

How  well  can  we  estimate  0  ?  The  answer  is  given  in  the  following  theorem  which 
is  from  Cramer  and  Rao.  As  pointed  out  above,  this  theorem  gives  a  lower  bound 
for  unbiased  estimators.  Hence,  all  estimators,  which  are  unbiased  and  attain  this 
lower  bound,  are  minimum  variance  estimators. 

/V 

Theorem  6.2  (Cramer-Rao)  If  0  —  t  =  t(X)  is  any  unbiased  estimator  for  6, 
then  under  regularity  conditions 


Var(f)  > 


(6.13) 


where 


Tn  =  E{5(A”;  0)s(X;  6>)t}  =  Max{s{X\  6>)}  (6.14) 

is  the  Fisher  information  matrix. 

Proof  Consider  the  correlation  py,z  between  Y  and  Z  where  Y  =  aTt ,  Z  =  cTs. 
Here  s  is  the  score  and  the  vectors  a,  c  e  IRC  By  Corollary  6.1  Cov(^,  t)  —  X  and 
thus 


Cov(T,  Z)  =  aT  Co v(c  s)c  =  aTc 
Var(Z)  =  cT  Var(s)c  =  cT Tnc. 


Hence, 

2  _  Co  v2(F,Z)  _  (aTc)2 

Py’z  ~  Var(F)  Var(Z)  _  aT  Var(f)a-cTj;,c  “ 


(6.15) 


In  particular,  this  holds  for  any  c  0.  Therefore  it  holds  also  for  the  maximum  of 
the  left-hand  side  of  (6.15)  with  respect  to  c.  Since 


cTaaTc  T  T 

max  — — -  =  max  c  aa  c 

C  C'TnC  cT  Tnc=\ 


max 

CT  Fnc=  1 


T  T  T  tt-1 

c  aa  c  —  a  J-„ 


n 


a 


and 
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by  our  maximisation  Theorem  2.5,  we  have 


CL~^  J~  ^  CL 

77  <1  V  a  eRp,  ay£0, 


aT  Var(7)a 


i.e. 


<3T{Var(0  —  Tn  l}a  >  0  V  a  e  Rp,  a  ^  0, 

which  is  equivalent  to  Var(7)  >  T~l .  □ 

MLEs  attain  the  lower  bound  if  the  sample  size  n  goes  to  infinity.  The  next 
Theorem  6.3  states  this  and,  in  addition,  gives  the  asymptotic  sampling  distribution 
of  the  maximum  likelihood  estimation,  which  turns  out  to  be  multinormal. 

Theorem  6.3  Suppose  that  the  sample  {*/}”_ t  is  i.i.d.  If  9  is  the  MLEfor  0  e  M.k, 

i.e.,  0  —  argmaxL(A;  0),  then  under  some  regularity  conditions,  as  n  — >  oo: 

o 

-  e)  -L  iv^o, ^r1)  (6-i6) 

where  T\  denotes  the  Fisher  information  for  sample  size  n  =  1. 

As  a  consequence  of  Theorem  6.3  we  see  that  under  regularity  conditions  the  MLE 
is  asymptotically  unbiased,  efficient  (minimum  variance)  and  normally  distributed. 
Also  it  is  a  consistent  estimator  of  0 . 

Note  that  from  property  (5.4)  of  the  multinormal  it  follows  that  asymptotically 

n(§-  0)TJi(0  -0)4  x2  (6.17) 


If  T\  is  a  consistent  estimator  of  T\  (e.g.  T\  —  Jri(0)),  we  have  equivalently 

n(d-6)TMd-d)S  x2p.  (6.18) 

This  expression  is  sometimes  useful  in  testing  hypotheses  about  0  and  in  construct¬ 
ing  confidence  regions  for  0  in  a  very  general  setup.  These  issues  will  be  raised 
in  more  detail  in  the  next  chapter,  but  from  (6.18)  it  can  be  seen,  for  instance,  that 
when  n  is  large, 


p{ n(9  -  0)T  fi(§  -  d)  <  xl-a;p }  »  1  -  a, 


where  xl;p 

—r—  /V  /V 

0)TJ'1(0  - 
for  9. 


denotes  the  v -quantile  of  a  xp  random  variable.  So,  the  ellipsoid  n(9  — 
9)  <  x\ -crp  provides  in  Rp  an  asymptotic  (1  —  a) -confidence  region 
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'  J»-  Summary 

^  The  score  function  is  the  derivative  s(X;0)  —  -^qI(X\  6)  of  the 

log-likelihood  with  respect  to  0.  The  covariance  matrix  of  s(W;  0) 
is  the  Fisher  information  matrix. 

^  The  score  function  has  mean  zero:  E {s(X;  0)}  =  0. 

^  The  Cramer-Rao  bound  says  that  any  unbiased  estimator  0  =  t  — 
t(X)  has  a  variance  that  is  bounded  from  below  by  the  inverse  of 
the  Fisher  information.  Thus,  an  unbiased  estimator,  which  attains 
this  lower  bound,  is  a  minimum  variance  estimator. 

^  For  i.i.d.  data  {xj  }ni=x  the  Fisher  information  matrix  is:  Tn  —  nT\ . 

MLEs  attain  the  lower  bound  in  an  asymptotic  sense,  i.e., 

*Jn(9  —  9)  -L  A^(0,  .Ff1) 

if  0  is  the  MLE  for  0  e  Rk ,  i.e.,  0  —  argmaxL^;  0). 

o 


6.3  Exercises 

Exercise  6.1  Consider  a  uniform  distribution  on  the  interval  [0,  0].  What  is  the 
MLE  of  0?  (Hint:  the  maximisation  here  cannot  be  performed  by  means  of 
derivatives.  Here  the  support  of  x  depends  on  0.) 

Exercise  6.2  Consider  an  i.i.d.  sample  of  size  n  from  the  bivariate  population  with 
pdf  f  (x i ,  xf)  —  (O\0i)~l  exp(— X\/0\  —  X2/O2),  X\,X2  >  0.  Compute  the  MLE  of 
0  —  (0\,  O2).  Find  the  Cramer-Rao  lower  bound.  Is  it  possible  to  derive  a  minimal 
variance  unbiased  estimator  of  0  ? 

Exercise  6.3  Show  that  the  MLE  of  Example  6.1,  jl  —  x,  is  a  minimal  variance 
estimator  for  any  finite  sample  size  n  (i.e.,  without  applying  Theorem  6.3). 

Exercise  6.4  We  know  from  Example  6.4  that  the  MLE  of  Example  6.1  has  T\  — 
Tp.  This  leads  to 


L  Np(0,I) 


_ ^ 

by  Theorem  6.3.  Can  you  give  an  analogous  result  for  the  square  x  for  the  case 

p  =  1? 
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Exercise  6.5  Consider  an  i.i.d.  sample  of  size  n  from  the  bivariate  population 
with  pdf  f(x\,X2)  —  (0f02*2)-1  exp(— X\/0\X2  —  Xi/QxQi),  X\,X2  >  0.  Compute 
the  MLE  of  0  —  (6\,  Of).  Find  the  Cramer-Rao  lower  bound  and  the  asymptotic 

/V 

variance  of  0. 

Exercise  6.6  Consider  a  sample  {xi}ni=l  from  Np(/jl,'E o)  where  Eo  is  known. 
Compute  the  Cramer-Rao  lower  bound  for  p.  Can  you  derive  a  minimal  unbiased 
estimator  for  \i  ? 

Exercise  6.7  Let  X  ~  Np(jjl,  E)  where  E  is  unknown  but  we  know 
E  =  diag(<Tn ,  O22,  •  •  • ,  opp).  From  an  i.i.d.  sample  of  size  n,  find  the  MLE  of  pt  and 

of*. 

Exercise  6.8  Reconsider  the  setup  of  the  previous  exercise.  Suppose  that 

E  =  diag(a 1 1 , 022 ,  •  •  • ,  opp)  • 

Can  you  derive  in  this  case  the  Cramer-Rao  lower  bound  for  0T  = 

1  •  •  •  P> p ,  0"]  1  . . .  op p)^ 

Exercise  6.9  Prove  Theorem  6.1.  Hint:  start  from  ^  E(7T)  =  f  tT(X;  6) 
L(X;  6)d  X,  then  permute  integral  and  derivatives  and  note  that  s(X;0)  = 

L(X;9) 

Exercise  6.10  Prove  expression  (6.12). 

(Hint:  start  from  E  {s(X;6)}  —  f  6)L(X\  6)dX  and  then  permute 

integral  and  derivative.) 


Chapter  7 

Hypothesis  Testing 


In  the  preceding  chapter,  the  theoretical  basis  of  estimation  theory  was  presented. 
Now  we  turn  our  interest  towards  testing  issues:  we  want  to  test  the  hypothesis  Hq 
that  the  unknown  parameter  0  belongs  to  some  subspace  of  Rq .  This  subspace  is 
called  the  null  set  and  will  be  denoted  by  £20  C  W. 

In  many  cases,  this  null  set  corresponds  to  restrictions  which  are  imposed  on  the 
parameter  space:  Ho  corresponds  to  a  “reduced  model”.  As  we  have  already  seen  in 
Chap.  3,  the  solution  to  a  testing  problem  is  in  terms  of  a  rejection  region  R  which 
is  a  set  of  values  in  the  sample  space  which  leads  to  the  decision  of  rejecting  the 
null  hypothesis  Ho  in  favour  of  an  alternative  H\ ,  which  is  called  the  “full  model”. 

In  general,  we  want  to  construct  a  rejection  region  R  which  controls  the  size  of 
the  type  I  error,  i.e.  the  probability  of  rejecting  the  null  hypothesis  when  it  is  true. 
More  formally,  a  solution  to  a  testing  problem  is  of  predetermined  size  a  if: 


P (Rejecting  Ho 


Ho  is  true)  =  a. 


In  fact,  since  Hq  is  often  a  composite  hypothesis,  it  is  achieved  by  finding  R  such 
that 


sup  P(A  G  R  |  6)  =  a. 

9  €£2  o 

In  this  chapter  we  will  introduce  a  tool  which  allows  us  to  build  a  rejection 
region  in  general  situations;  it  is  based  on  the  likelihood  ratio  principle.  This  is 
a  very  useful  technique  because  it  allows  us  to  derive  a  rejection  region  with  an 
asymptotically  appropriate  size  a.  The  technique  will  be  illustrated  through  various 
testing  problems  and  examples.  We  concentrate  on  multinormal  populations  and 
linear  models  where  the  size  of  the  test  will  often  be  exact  even  for  finite  sample 
sizes  n . 

Section  7.1  gives  the  basic  ideas  and  Sect.  7.2  presents  the  general  problem  of 
testing  linear  restrictions.  This  allows  us  to  propose  solutions  to  frequent  types 


©  Springer- Verlag  Berlin  Heidelberg  2015 

W.K.  Hardle,  L.  Simar,  Applied  Multivariate  Statistical  Analysis, 

DOI  10.1007/978-3-662-45171-7  7 


213 


214 


7  Hypothesis  Testing 


of  analyses  (including  comparisons  of  several  means,  repeated  measurements  and 
profile  analysis).  Each  case  can  be  viewed  as  a  simple  specific  case  of  testing  linear 
restrictions.  Special  attention  is  devoted  to  confidence  intervals  and  confidence 
regions  for  means  and  for  linear  restrictions  on  means  in  a  multinormal  setup. 


7.1  Likelihood  Ratio  Test 

Suppose  that  the  distribution  of  {xz-  }"=1,  x/  e  Rp,  depends  on  a  parameter  vector  0. 
We  will  consider  two  hypotheses: 


Hq  :  0  g 
H\  :  6  e 


The  hypothesis  Ho  corresponds  to  the  ‘‘reduced  model”  and  H\  to  the  “full  model”. 
This  notation  was  already  used  in  Chap.  3. 

Example  7.1  Consider  a  multinormal  Np(6,  X).  To  test  if  0  equals  a  certain  fixed 
value  6>0  we  construct  the  test  problem: 

Ho  :  0  =  So 

H\  :  no  constraints  on  0 

or,  equivalently,  £2o  =  {#o}> 

Define  L*  =  ma xL(X;0),  the  maxima  of  the  likelihood  for  each  of  the 

J  OeQj 

hypotheses.  Consider  the  likelihood  ratio  (LR) 

KX)  -  E-.  (7.1) 

L\ 

One  tends  to  favour  Ho  if  the  LR  is  high  and  H\  if  the  LR  is  low.  The  likelihood 
ratio  test  (LRT)  tells  us  when  exactly  to  favour  Ho  over  H\ .  A  LRT  of  size  a  for 
testing  Ho  against  H\  has  the  rejection  region 

R  —  {X  \  A  (AO  <  c}, 

where  c  is  determined  so  that  sup  P#  (X  e  R)  =  a.  The  difficulty  here  is  to  express 

(9  o 

c  as  a  function  of  a ,  because  X(X)  might  be  a  complicated  function  of  X. 

Instead  of  A  we  may  equivalently  use  the  log-likelihood 


2  log  A  =  2(£*  —  £q). 
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In  this  case  the  rejection  region  will  be  R  —  {X  :  — 21ogA(A)  >  k}.  What  is  the 
distribution  of  A  or  of  —2  log  A  from  which  we  need  to  compute  c  or  A? 

Theorem  7.1  IfQ\  C  W1  is  a  q-dimensional  space  and  if  £2 o  C  £2\  is  an  r- 
dimensional  subspace ,  then  under  regularity  conditions 

c  9 

V  6  e  £2 o  :  —2  log  A  — >  /  as  n  ->  oo. 

An  asymptotic  rejection  region  can  now  be  given  by  simply  computing  the  1  —  a 
quantile  k  —  x\_a.q_r  •  The  LRT  rejection  region  is  therefore 

R  =  {X  :  —2  log  A  (A”)  >  x\-a;q-r}- 

Theorem  7. 1  is  thus  very  helpful:  it  gives  a  general  way  of  building  rejection  regions 
into  many  problems.  Unfortunately,  it  is  only  an  asymptotic  result,  meaning  that 
the  size  of  the  test  is  only  approximately  equal  to  a,  although  the  approximation 
becomes  better  when  the  sample  size  n  increases.  The  question  is  “how  large  should 
n  be?”.  There  is  no  definite  rule:  we  encounter  here  the  same  problem  that  was 
already  discussed  with  respect  to  the  Central  Limit  Theorem  in  Chap.  4. 

Fortunately,  in  many  standard  circumstances,  we  can  derive  exact  tests  even  for 
finite  samples  because  the  test  statistic  —2  log  A  (A)  or  a  simple  transformation  of  it 
turns  out  to  have  a  simple  form.  This  is  the  case  in  most  of  the  following  standard 
testing  problems.  All  of  them  can  be  viewed  as  an  illustration  of  the  likelihood  ratio 
principle. 

Test  Problem  1  is  an  amuse-bouche :  in  testing  the  mean  of  a  multinormal 
population  with  a  known  covariance  matrix  the  likelihood  ratio  statistic  has  a  very 
simple  quadratic  form  with  a  known  distribution  under  Hq. 


Test  Problem  1.  Suppose  that  X\, ... ,  Xn  is  an  i.i.d.  random  sample  from  a 
Np(/JL ,  E)  population. 

Hq  :  p  —  /xq,  E  known  versus  H\  :  no  constraints. 


In  this  case  Ho  is  a  simple  hypothesis,  i.e.  £?o  =  {/xo}  and  therefore  the 
dimension  r  of  £?o  equals  0.  Since  we  have  imposed  no  constraints  in  H\,  the  space 
is  the  whole  Rp  which  leads  to  q  —  p.  From  (6.6)  we  know  that 

l*0  =  l{pu o,  E)  =  —  log  |2jtE|  -  l«tr(E_15)  -  T(x  -  /a0)TE_1(x  -  /i0). 
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Under  H\  the  maximum  of  l(p,  E)  is 


Therefore, 


Yl  1 

l*  =£(x,  E)  =  — —  log  |2;rE|  —  -n  tr(E 


-l 


S) 


—  2  log  A  =  2(1*  —  £q)  =  n(x  —  /xo)TE  l(x  —  po)  (7.2) 

which,  by  Theorem  4.7,  has  a  /^-distribution  under  Ho. 

Example  7.2  Consider  the  bank  data  again.  Let  us  test  whether  the  population  mean 
of  the  forged  bank  notes  is  equal  to 


po  =  (214.9, 129.9, 129.7,  8.3, 10.1, 141.5)T. 

(This  is  in  fact  the  sample  mean  of  the  genuine  bank  notes.)  The  sample  mean  of 
the  forged  bank  notes  is 

x  =  (214.8,  130.3, 130.2, 10.5, 11.1, 139.4)T. 

Suppose  for  the  moment  that  the  estimated  covariance  matrix  Sf  given  in  (3.5)  is 
the  true  covariance  matrix  E.  We  construct  the  LRT  and  obtain 

—2  log  A  =  2(1*  —  £q)  =  n  (x  —  /xo)TE-1(x  —  po) 

=  7362.32, 

the  quantile  k  —  /q95-6  e4uals  12.592.  The  rejection  consists  of  all  values  in  the 
sample  space  which  lead  to  values  of  the  LRT  statistic  larger  than  12.592.  Under 
Ho  the  value  of  —2  log  A  is  therefore  highly  significant.  Hence,  the  true  mean  of  the 
forged  bank  notes  is  significantly  different  from  po ! 

Test  Problem  2  is  the  same  as  the  preceding  one  but  in  a  more  realistic  situation 
where  the  covariance  matrix  is  unknown;  here  the  Hotelling’s  7 2 -distribution  will 
be  useful  to  determine  an  exact  test  and  a  confidence  region  for  the  unknown  p. 


Test  Problem  2.  Suppose  that  X\, . . . ,  Xn  is  an  i.i.d.  random  sample  from  a 
Np(p,  E)  population. 

Ho  :  p  —  po,  E  unknown  versus  H\  :  no  constraints. 
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Under  Ho  it  can  be  shown  that 

•So  —  X  1  n  Mo  1 77  X  +  \nX  X  1/7  Mo  1/7-^  “I”  1/7-^ 

«  L  J  L  J 

=  cS  +  (x  -  flo)  (x  -  /xo)T 

£q  =  l{fio,  S  +  r/r/T),  d  —  (x  —  fi o)  (7.3) 

and  under  77)  we  have 

7*  =£(x,S). 

This  leads  after  some  calculation  to 


—2  log  A  = 


=  2(£  r  -  r0) 

=  —n  log  |<S|  —  n  tr(<S_1<S)  —  n  (x  —  x)T  S~l  (x  —  x)  +  n  log  | S  +  ddT 
+n  tr  (<S  +  ddT )  1  S  +  n  (x  —  \i o)T  ( S  +  ddT)~l  (. x  —  po) 

=  n  log  — -  +  ft  tr  [(<S  +  r/r/T)_1<S]  +  ndT(S  +  ddT)~ld  —  np 

<u 


n  log  — -  +  ft  tr  [(<S  +  ddT)  1  (< ddT  +  <S)]  —  np 

<u 


=  ft  log 


<S  +  ddT 


—  n  log  1 1  -|-  <S  l^2ddTS  1//2 


By  using  the  result  for  the  determinant  of  a  partitioned  matrix,  it  equals  to 

1  - dTS~l/ 2 

" 108  s-'*d  / 

1  -dTS~^2 1  -dTS-V22  ■  ■  ■  -dTS~^2p 

S~xl2dx  1  0  ...  0 

—  niog  <S-1/26?2  0  1  0 


S~l/2dp  0  0  ...  1 
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S~l/2d i  1  0  ...  0 
S~1/2d2  0  1  ...  0 


P 

—  n  log  1  +  n  log  ^  —  dTS~l^2i  (—  i)1+(z+1) 

i  =  1 


S~l/2di  0  0...  0 


S~l/2dp  0  0...  1 

=  n  log  1  +  ^  —  dTS~l^2i(—  \)2+t S~l^2di(— 1)'+1 

i  =  1 


=  n  log(l  +  dTS  ld). 


(7.4) 


This  statistic  is  a  monotone  function  of  (n  —  l)dT S  ld.  This  means  that  —2  log  A  > 
k  if  and  only  if  (n  —  l)dTS~ld  >  k' .  The  latter  statistic  has  by  Corollary  5.3,  under 
Hq,  a  Hotelling’s  T2 -distribution.  Therefore, 

(n  -  l)(x  -  (x  -  no)  ~  T2n_\ ,  (7.5) 


or  equivalently 


n  —  p 


(x  -  Ho)TS  l(x  -  Ho)  ~  Fp,n-p 


In  this  case  an  exact  rejection  region  may  be  defined  as 


(7.6) 


Yl  —  P  \ 

“  )  (A  fl0)  Mo)  ^  F\  —a’,p,n—p- 

Alternatively,  we  have  from  Theorem  7.1  that  under  Ho  the  asymptotic  distribution 
of  the  test  statistic  is 


c 

—2  log  A — >  Xp>  as  n  — >  oo 
which  leads  to  the  (asymptotically  valid)  rejection  region 

nlog{l  +  (x  —  Ho)T (x  —  Ho)}  >  X2l-cr,p> 

but  of  course,  in  this  case,  we  would  prefer  to  use  the  exact  F- test  provided  just 
above. 

Example  7.3  Consider  the  problem  of  Example  7.2  again.  We  know  that  Sf  is  the 
empirical  analogue  for  E/,  the  covariance  matrix  for  the  forged  banknotes.  The  test 
statistic  (7.5)  has  the  value  1,153.4  or  its  equivalent  for  the  F  distribution  in  (7.6) 
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is  182.5  which  is  highly  significant  (i7o.95;6,94  =  2.1966)  so  that  we  conclude  that 
M/  7^  Mo- 


Confidence  Region  for  fi 

When  estimating  a  multidimensional  parameter  6  e  R*  from  a  sample,  we  saw  in 

/V  /V 

Chap.  6  how  to  determine  the  estimator  0  =  6(X).  For  the  observed  data  we  end 
up  with  a  point  estimate,  which  is  the  corresponding  observed  value  of  0 .  We  know 

/V 

0(A)  is  a  random  variable  and  we  often  prefer  to  determine  a  confidence  region  for 
0 .  A  confidence  region  (CR)  is  a  random  subset  of  Rk  (determined  by  appropriate 
statistics)  such  that  we  are  “confident”,  at  a  certain  given  level  1  —  a,  that  this  region 
contains  0 : 


P(6>  e  CR)  =  1—0'. 

This  is  just  a  multidimensional  generalisation  of  the  basic  univariate  confidence 
interval.  Confidence  regions  are  particularly  useful  when  a  hypothesis  Ho  on  0 
is  rejected,  because  they  eventually  help  in  identifying  which  component  of  0  is 
responsible  for  the  rejection. 

There  are  only  a  few  cases  where  confidence  regions  can  be  easily  assessed,  and 
include  most  of  the  testing  problems  on  mean  presented  in  this  section. 

Corollary  5.3  provides  a  pivotal  quantity  which  allows  confidence  regions  for  /x 

to  be  constructed.  Since  (x  —  fi)TS~l(x  —  /x)  ~  Fppi-p,  we  have 

(n  -  x)TS-1  (n  -  x)  <  Fi-a-p<n-p  |  =  1  -  a. 


n  —  p 
v 


Then, 


CR  =  ]  n  e  RR  |  (/i  -  x)T<S  l{ji  -  x)  <  -  Fi-a;p,n-p 

(  n-p 

is  a  confidence  region  at  level  (1-gO  for  /x.  It  is  the  interior  of  an  iso-distance 
ellipsoid  in  centred  at  x,  with  a  scaling  matrix  S~l  and  a  distance  constant 

F\-u;p,n-p-  When  p  is  large,  ellipsoids  are  not  easy  to  handle  for  practical 

purposes.  One  is  thus  interested  in  finding  confidence  intervals  for  /xi,  /X2, . . . ,  lip 
so  that  simultaneous  confidence  on  all  the  intervals  reaches  the  desired  level  of  say, 

1  —  O'. 

Below,  we  consider  a  more  general  problem.  We  construct  simultaneous  confi¬ 
dence  intervals  for  all  possible  linear  combinations  aT /x,  a  e  Rp  of  the  elements 
of  /x. 
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Suppose  for  a  moment  that  we  fix  a  particular  projection  vector  a.  We  are  back 
to  a  standard  univariate  problem  of  finding  a  confidence  interval  for  the  mean  aT  /i 
of  a  univariate  random  variable  aT X .  We  can  use  the  ^-statistics  and  an  obvious 
confidence  interval  for  aT  p  is  given  by  the  values  aT /i  such  that 


> Jn  —  1  (aT  /i  —  aTx ) 
V 'aJ Sa 


<  h-%  ;n-l 


or  equivalently 


f2(,  (n  —  1)  {aT(/x  —  x)} 

t  {CL)  —  -  -[-  ^  5  -T \—or,\,n—\ 


a 


TSa 


This  provides  the  (1  —  a)  confidence  interval  for  aT  fi: 


ci  x  \l  Fi—a-\  n—i 


aTSa 
n  —  1 


<  aT ji  <  aTx  + 


F\  —a ;  1 ,7?  —  1 


n  —  1 


Now  it  is  easy  to  prove  (using  Theorem  2.5)  that: 

ma xt2(a)  =  (n  —  l)(x  —  /i)TS~l(x  —  /x)  ~  7"?  x. 

a  V’ 

Therefore,  simultaneously  for  all  a  e  M^,  the  interval 

(aTx  —  V KaaTSa,  aT x  +  V ,  (7.7) 


where  =  H^Fi-a]pp-p,  will  contain  aT fi  with  probability  (1  —  a). 

A  particular  choice  of  a  are  the  columns  of  the  identity  matrix  Xp ,  providing 
simultaneous  confidence  intervals  for  /Xi, . . . ,  /ip.  We  therefore  have  with  probabil¬ 
ity  (1  —  a)  for  j  —  1 , . . . ,  p 


n  —  p 


Fl-a-,P,n-pSjj 


<  11  j  <  Xj  + 


n  —  p 


F\-a\p,n-psjj' 


(7.8) 


It  should  be  noted  that  these  intervals  define  a  rectangle  inscribing  the  confidence 
ellipsoid  for  /x  given  above.  They  are  particularly  useful  when  a  null  hypothesis 
Ho  of  the  type  described  above  is  rejected  and  one  would  like  to  see  which 
component(s)  are  mainly  responsible  for  the  rejection. 

Example  7.4  The  95  %  confidence  region  for  /x/ ,  the  mean  of  the  forged  banknotes, 
is  given  by  the  ellipsoid: 


/x  e  R( 


(fl  —  Xf)TSfl(fl  —  Xf)  <  —  ^b.95;6,94 


6 
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The  95  %  simultaneous  confidence  intervals  are  given  by  (we  use  ^o.95;6,94  = 
2.1966) 


214.692  <  /z  1  < 

214.954 

130.205  <  /z 2  < 

130.395 

130.082  <  /Z3  < 

130.304 

10.108  <  ji\  < 

10.952 

10.896  <  [is  < 

11.370 

139.242  <  He  < 

139.658. 

Comparing  the  inequalities  with  /x0  =  (214.9, 129.9, 129.7,  8.3,  10.1, 141. 5)T 
shows  that  almost  all  components  (except  the  first  one)  are  responsible  for  the 
rejection  of  /Zo  in  Examples  7.2  and  7.3. 

In  addition,  the  method  can  provide  other  confidence  intervals.  We  have  at  the 
same  level  of  confidence  (choosing  a T  =  (0,  0,  0,  1,  —  1,  0)) 


—  1.21 1  <  fi4  —  115  <  0.005 


showing  that  for  the  forged  bills,  the  lower  border  is  essentially  smaller  than  the 
upper  border. 

Remark  7. 1  It  should  be  noted  that  the  confidence  region  is  an  ellipsoid  whose 
characteristics  depend  on  the  whole  matrix  S.  In  particular,  the  slope  of  the  axis 
depends  on  the  eigenvectors  of  S  and  therefore  on  the  covariances  S[j.  However,  the 
rectangle  inscribing  the  confidence  ellipsoid  provides  the  simultaneous  confidence 
intervals  for  fij ,  j  —  1, . . . ,  p.  They  do  not  depend  on  the  covariances  Sy,  but 
only  on  the  variances  Sjj  [see  (7.8)].  In  particular,  it  may  happen  that  a  tested  value 
fio  is  covered  by  the  confidence  ellipsoid  but  not  covered  by  the  intervals  (7.8).  In 
this  case,  /zo  is  rejected  by  a  test  based  on  the  simultaneous  confidence  intervals 
but  not  rejected  by  a  test  based  on  the  confidence  ellipsoid.  The  simultaneous 
confidence  intervals  are  easier  to  handle  than  the  full  ellipsoid  but  we  have  lost  some 
information,  namely  the  covariance  between  the  components  (see  Exercise  7.14). 

The  following  problem  concerns  the  covariance  matrix  in  a  multinormal  popula¬ 
tion:  in  this  situation  the  test  statistic  has  a  slightly  more  complicated  distribution. 
We  will  therefore  invoke  the  approximation  of  Theorem  7.1  in  order  to  derive  a  test 
of  approximate  size  a. 


Test  Problem  3.  Suppose  that  X\, . . . ,  Xn  is  an  i.i.d.  random  sample  from  a 
Np(fi ,  S)  population. 

Ho  :  S  =  So,  /z  unknown  versus  H\  :  no  constraints. 
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Under  Ho  we  have  jl  —  x,  and  X  =  So,  whereas  under  H \  we  have  p  —  x,  and 
X  =  tS.  Hence 


=  £(x,  S0)  =  log  |27tS0|  -  tr(S0  *5) 


i*  —  l(x,  S)  =  —  -«  log  |27r<S|  —  -np 


and  thus 


-2  log  A  =  2  {IX -ID 

—  n  tr(Xo  !<S)  —  n  log  |Sq  !«S|  —  np. 

Note  that  this  statistic  is  a  function  of  the  eigenvalues  of  X^  lS.  Unfortunately,  the 
exact  finite  sample  distribution  of  —2  log  A  is  very  complicated.  Asymptotically,  we 
have  under  Ho 


—2  log  A 


as  n 


oo 


with  m  —  \{p(p  +  1)},  since  a  (p  x  p)  covariance  matrix  has  only  these  m 
parameters  as  a  consequence  of  its  symmetry. 


Example  7.5  Consider  the  US  companies  data  set  (Table  22.5)  and  suppose  we 
are  interested  in  the  companies  of  the  energy  sector,  analysing  their  assets  (Xi) 
and  sales  (X2).  The  sample  is  of  size  15  and  provides  the  value  of  S  —  107  x 


1.6635  1.2410 
1.2410  1.3747 


We  want  to  test  if  Var  (^‘) 


107  x 


1.2248  1.1425 
1.1425  1.5112 


(X0  is  in  fact  the  empirical  variance  matrix  for  X\  and  X2  for  the  manufacturing 
sector).  The  test  statistic  (Q  MVAusenergy)  turns  out  to  be  —2  log  A  =  5.4046 
which  is  not  significant  for  x\  (/?-value  —  0.1445).  So  we  cannot  conclude  that 
S^X0. 


In  the  next  testing  problem,  we  address  a  question  that  was  already  stated  in 
Chap.  3,  Sect.  3.6:  testing  a  particular  value  of  the  coefficients  /3  in  a  linear  model. 
The  presentation  is  carried  out  in  general  terms  so  that  it  can  be  built  on  in  the  next 
section  where  we  will  test  linear  restrictions  on  f}. 


Test  Problem  4.  Suppose  that  Y\ , . . . ,  Yn  are  independent  r.v.’s  with 

Yi  ~  Ni(PTxit  a2),  Xi  e  W. 

Ho  :  p  =  po,  <J  unknown  versus  H\  :  no  constraints. 
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Under  Ho  we  have  /3  =  /3o,<7q  —  “llj7  ~  ^Po\\2  and  under  H\  we  have  /3 
(XTX)-lXTy,a2  =  - 1  |t  —  Xfi  |  |2  (see  Example  6.3).  Hence  by  Theorem  7.1 


-2  log  A  =  2(£*-f*) 


=  n  log 


II 

lb--W 


£ 


■>  A 


p 


We  draw  upon  the  result  (3.45)  which  gives  us 


F  =  in  -  p)  I  \\y-XPo\\2  _  i 
P  \\\y-XP\\* 


F 

1  p,n—pi 


so  that  in  this  case  we  again  have  an  exact  distribution. 

Example  7.6  Let  us  consider  our  “classic  blue”  pullovers  again.  In  Example  3.11 
we  tried  to  model  the  dependency  of  sales  on  prices.  As  we  have  seen  in  Fig.  3.5 
the  slope  of  the  regression  curve  is  rather  small,  hence  we  might  ask  if  (“)  =  (2q!). 
Here 


(1  *1,2  \ 

1  *10,2/ 

The  test  statistic  for  the  LR  test  is 

-2  log  A  =  9.10 

which  under  the  /2  distribution  is  significant.  The  exact  E-test  statistic 

F  =  5.93 

is  also  significant  under  the  ^2,8  distribution  (E2,8;o.95  =  4.46). 


uu 


'  Summary 

^  The  hypotheses  Hq  :  6  e  £20  against  H\  :  0  e  Q\  can  be  tested 
using  the  LRT.  The  likelihood  ratio  (LR)  is  the  quotient  A  (A)  = 
L*/L*  where  the  L*  are  the  maxima  of  the  likelihood  for  each  of 
the  hypotheses. 
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Summary  (continued) 

^  The  test  statistic  in  the  LRT  is  A  (A)  or  equivalently  its  logarithm 
logA(,T).  If  is  ^-dimensional  and  Qo  C  Q\  r -dimensional, 
then  the  asymptotic  distribution  of  —2  log  A  is  /2_r .  This  allows  Ho 
to  be  tested  against  H\  by  calculating  the  test  statistic  —2  log  A  = 
2 (f*  —  fj)  where  l*  —  logL*. 

The  hypothesis  Ho  :  /x  —  fio  for  X  ~  Np(fi,  E),  where  E  is 
known,  leads  to  —2  log  A  =  n(x  —  /x o)TE_1(x  —  /xo)  ~  /2 . 

^  The  hypothesis  Ho  :  /x  =  /xo  for  X  ~  Np(fi,  E),  where  E 
is  unknown,  leads  to  —2  log  A  =  /t  log{  1  +  (x  —  /xo)T<S_1(x  — 
Mo)}  — *  /p,  and 

(n  -  \)(x  -  no)TS~l(x  -  no)  ~  7’2„_|. 

^  The  hypothesis  //o  •  E  =  Eo  for  X  ~  Np(fi,  E),  where  /x  is 
unknown,  leads  to —2  log  A  =  n  tr  (Eq  ■ 1<S)  —  n  log  \  ^QlS\—np  — > 

Xm<  m  =  \p(P  +  !)• 

^  The  hypothesis  Ho  :  /3  =  j$o  for  Yj  V(/*T  X/ ,  a2),  where  a2  is 

unknown,  leads  to  —2  log  A  =  n  log  (  ^~^°|  p  )  — >  7^- 


7.2  Linear  Hypothesis 

In  this  section,  we  present  a  very  general  procedure  which  allows  a  linear  hypothesis 
to  be  tested,  i.e.  a  linear  restriction,  either  on  a  vector  mean  /x  or  on  the  coefficient 
/3  of  a  linear  model.  The  presented  technique  covers  many  of  the  practical  testing 
problems  on  means  or  regression  coefficients. 

Linear  hypotheses  are  of  the  form  Afi  —  a  with  known  matrices  A(q  x  p )  and 
a(q  x  1)  with  q  <  p. 

Example  7.7  Let  /x  =  (/x i,  /X2)T.  The  hypothesis  that  /x i  =  /X2  can  be  equivalently 
written  as: 


The  general  idea  is  to  test  a  normal  population  Ho  :  A/x  =  a  (restricted  model) 
against  the  full  model  H\  where  no  restrictions  are  put  on  /x.  Due  to  the  properties  of 
the  multinormal,  we  can  easily  adapt  the  Test  Problems  1  and  2  to  this  new  situation. 
Indeed  we  know,  from  Theorem  5.2,  that  y,  =  Ax,  ~  Nq  (jiy ,  Ey),  where  /xy  = 
A/x  and  Ey  =  AEAT. 


7.2  Linear  Hypothesis 


225 


Testing  the  null  Ho  :  A\i  —  a,  is  the  same  as  testing  Ho  :  fiy  —  a.  The 
appropriate  statistics  are  y  and  Sy  which  can  be  derived  from  the  original  statistics 
x  and  S  available  from  7b : 


y  —  Ax,  Sy  =  ASA1 . 

Here  the  difference  between  the  translated  sample  mean  and  the  tested  value  is  d  — 
Ax  —  a.  We  are  now  in  the  situation  to  proceed  to  Test  Problems  5  and  6. 


Test  Problem  5.  Suppose  X\,...,Xn  is  an  i.i.d.  random  sample  from  a 
Np(pi,  £)  population. 

Ho  :  Apt  —  a,  X  known  versus  H\  :  no  constraints. 


By  (7.2)  we  have  that,  under  Ho : 

n(Ax  —  a)T (ATiAt)~1  (Ax  —  a)  ~ 

and  we  reject  Ho  if  this  test  statistic  is  too  large  at  the  desired  significance  level. 

Example  7.8  We  consider  hypotheses  on  partitioned  mean  vectors  /i  —  ).  Let 

us  first  look  at 


Ho  :  /xi  =  /X2,  versus  //i  :  no  constraints, 

for  N2P((^,  with  known  £.  This  is  equivalent  to  A  —  (X,  —X),  a  — 

(0, . . . ,  0)T  e  Rp  and  leads  to 

-2  log  A  =  n(x i  -  x2)(2E)_1(xi  -x2)  ~  x2P- 

Another  example  is  the  test  whether  pi  =  0,  i.e. 

Ho  :  fi\  —  0,  versus  H\  :  no  constraints, 

for  N2pfy,  j))  with  known  S.  This  is  equivalent  to  A\i  —  a  with  A  — 
(X,  0),  and  a  =  (0, . . . ,  0)T  G  Rp .  Hence 

_  _ 1  _  r\ 

— 2\ogX  —  nx\X  X\~xp- 
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Test  Problem  6.  Suppose  X\,...,Xn  is  an  i.i.d.  random  sample  from  a 
Np(fi ,  E)  population. 

H0  :  Afi  —  a,  E  unknown  versus  H\  :  no  constraints. 


From  Corollary  (5.4)  and  under  Ho  it  follows  immediately  that 

(n  -  1  )(Ax  -  a)T  (ASA^THAx  -  a)  ~  (7.9) 

since  indeed  under  Hq, 


Ax  ~  Nq(a,n  1AT<At) 


is  independent  of 


«AS.4T  ~  -  1). 

Example  7.9  Let’s  come  back  again  to  the  bank  data  set  and  suppose  that  we  want 
to  test  if  /i 4  =  /i 5,  i.e.  the  hypothesis  that  the  lower  border  mean  equals  the  larger 
border  mean  for  the  forged  bills.  In  this  case: 

=  (0  001  —  1  0) 

a  —  0. 


The  test  statistic  is: 


99(v4x)T(^45/^4T)_1(^lx)  ~  p99  =  Fh99. 
The  observed  value  is  13.638  which  is  significant  at  the  5  %  level. 


Repeated  Measurements 

In  many  situations,  n  independent  sampling  units  are  observed  at  p  different  times 
or  under  p  different  experimental  conditions  (different  treatments,  etc.).  So  here  we 
repeat  p  one-dimensional  measurements  on  n  different  subjects.  For  instance,  we 
observe  the  results  from  n  students  taking  p  different  exams.  We  end  up  with  a 
(n  x  p)  matrix.  We  can  thus  consider  the  situation  where  we  have  X\ , . . . ,  Xn  i.i.d. 
from  a  normal  distribution  Np(fi,  E)  when  there  are  p  repeated  measurements.  The 
hypothesis  of  interest  in  this  case  is  that  there  are  no  treatment  effects,  Ho  :  fi\  — 
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ji2  =  •  •  •  =  fip.  This  hypothesis  is  a  direct  application  of  Test  Problem  6.  Indeed, 
introducing  an  appropriate  matrix  transform  on  /x  we  have 


Ho  :  Cfi  =  0  where  C(Q?  —  1)  x  />)  = 


/I  -1  0  ■■■  0  \ 
0  1  -1  •••  0 


\0  •••  0  1-1 ) 


(7.10) 


Note  that  in  many  cases  one  of  the  experimental  conditions  is  the  “control”  (a 
placebo,  standard  drug  or  reference  condition).  Suppose  it  is  the  first  component. 
In  that  case  one  is  interested  in  studying  differences  to  the  control  variable.  The 
matrix  C  has  therefore  a  different  form 


C((p  -  1)  X  p)  = 


/  1  -1  0  •••  0\ 

1  0  -1  •••  0 


\1  0  0 - 1/ 


By  (7.9)  the  null  hypothesis  will  be  rejected  if: 

{'n~P+  l^xTCT(CSCTrlCx  >  Fi-a‘p—in—p-\-i . 

P  ~  1 


As  a  matter  of  fact,  C/x  is  the  mean  of  the  random  variable  y\  —  Cxi 

yi  ~  Np-i(Cii,C^Ct). 

Simultaneous  confidence  intervals  for  linear  combinations  of  the  mean  of  yi  have 
been  derived  above  in  (7.7).  For  all  a  e  M^-1 ,  with  probability  (1  —  a)  we  have 


aTC/x  G  a  1  Cx  =h 


T  n~ 


(p  -  0 
n  -  p  + 


1  F\— or,p— l,n— p+\ Cl~^CSC~^CL. 


Due  to  the  nature  of  the  problem  here,  the  row  sums  of  the  elements  in  C  are  zero: 
Clp  =  0,  therefore  aTC  is  a  vector  having  sum  of  elements  equals  to  0  .  This  is 

p 

called  a  contrast.  Let  b  —  CTa.  We  have  bT \p  —  ^  bj  —  0.  The  result  above 

7=1 

thus  provides  for  all  contrasts  of  /x,  and  bT [i  simultaneous  confidence  intervals  at 
level  (1  —  of) 


bT fl  G  b  1  X  d= 


T - 


(p  - 1) 
n  -  p  + 


1  F\—cr,p—l,n—p+\  b~^ Sb . 
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Examples  of  contrasts  for  p  —  4  are  bT  =  (1  — 10  0)  or  (100  —  1)  or  even 
(1  —  |  |  —  |)  when  the  control  is  to  be  compared  with  the  mean  of  three 

different  treatments. 

Example  7.10  Bock  (1975)  considers  the  evolution  of  the  vocabulary  of  children 
from  the  eighth  through  eleventh  grade.  The  data  set  contains  the  scores  of  a 
vocabulary  test  of  40  randomly  chosen  children.  This  is  a  repeated  measurement 
situation,  (n  —  40,  p  —  4),  since  the  same  children  were  observed  from  grades  8  to 
11.  The  statistics  of  interest  are: 

x  =  (1.086, 2.544, 2.851,  3.420)t 

/ 2.902  2.438  2.963  2.183\ 

2.438  3.049  2.775  2.319 
2.963  2.775  4.281  2.939  ' 

V 2. 183  2.319  2.939  3.162/ 

Suppose  we  are  interested  in  the  yearly  evolution  of  the  children.  Then  the  matrix  C 
providing  successive  differences  of  fij  is: 

/  1  — 1  0  0\ 

C  =  0  1-1  0  . 

\0  0  1-1/ 

The  value  of  the  test  statistic  is  F0 bS  =  53.134  which  is  highly  significant  for 
F3.37.  There  are  significant  differences  between  the  successive  means.  However, 
the  analysis  of  the  contrasts  shows  the  following  simultaneous  95  %  confidence 
intervals 


—  1.958  <  /i  1  —  ji2  S  —0.959 

—0.949  <  fi2  —  M3  <  0.335 

—  1.171  <  /Z3  —  /X4  <  0.036. 

Thus,  the  rejection  of  Hq  is  mainly  due  to  the  difference  between  the  childrens’ 
performances  in  the  first  and  second  year.  The  confidence  intervals  for  the  following 
contrasts  may  also  be  of  interest: 


— 2.283  5  Mi  —  j (/x 2  T  M 3  T-  M4)  —  — 1.423 
— 1.777  5  |(Mi  T  /x 2  T  M3)  —  M 4  —  — 0.742 
— 1.479  <  M2  —  M4  <—0.272. 


They  show  that  Mi  is  different  from  the  average  of  the  3  other  years  (the  same  being 
true  for  M4)  and  m 4  turns  out  to  be  higher  than  M2  (and  of  course  higher  than  Mi)- 
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Test  Problem  7  illustrates  how  the  likelihood  ratio  can  be  applied  to  testing  a 
linear  restriction  on  the  coefficient  P  of  a  linear  model.  It  is  also  shown  how  a 
transformation  of  the  test  statistic  leads  to  an  exact  F  test  as  presented  in  Chap.  3. 


Test  Problem  7.  Suppose  Y\ , . . . ,  Yn ,  are  independent  with 

Yi  ~  N\(PTXi ,  a2),  and  Xi  e  IRC 

Hq  :  Aj 3  =  a,  a2  unknown  versus  H\  :  no  constraints. 


To  get  the  constrained  maximum  likelihood  estimators  under  Ho ,  let  /(/3,  A)  = 
(y  —  xfi)T  (y  —  xfi)  —  AT (A/3  —  a )  where  X  e  Rq  and  solve  —  0  and 

0  (Exercise  3.24),  thus  we  obtain: 


P  =  P-  (XTXTlAT{A(XTxylATrl(AP  -a) 

for  p  and  a2  —  -(y  —  X ft)'  (y  —  X ft).  The  estimate  ft  denotes  the  unconstrained 
MLE  as  before.  Hence,  the  LR  statistic  is 


-2  log  A  =  2(1* -ll) 


—  n  log 


\\y-xp\\2 


c 


*  Xq> 


where  q  is  the  number  of  elements  of  a.  This  problem  also  has  an  exact  E-test  since 


n-p  \\y-Xp\\ 


\\\y-x$\\2  1 

n  —  p  (A/3  —  a)T{A(XTX)-1AT}~l(Ap 


—  a) 


q 


(y  -  XP)T(y  -  Xp) 


1  q,n—p 


Example  7.11  Let  us  continue  with  the  “classic  blue”  pullovers.  We  can  once  more 
test  if  P  —  0  in  the  regression  of  sales  on  prices.  It  holds  that 


P  =  0  iff  (0  1) 


The  LR  statistic  here  is 


2  log  A  =  0.284 
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which  is  not  significant  for  the  x\  distribution.  The  F- test  statistic 

F  =  0.231 

is  also  not  significant.  Hence,  we  can  assume  independence  of  sales  and  prices 
(alone).  Recall  that  this  conclusion  has  to  be  revised  if  we  consider  the  prices 
together  with  advertising  costs  and  hours  sales  manager  hours. 

Recall  the  different  conclusion  that  was  made  in  Example  7.6  when  we  rejected 
Hq  :  a  =  21 1  and  P  —  0.  The  rejection  there  came  from  the  fact  that  the  pair  of 
values  was  rejected.  Indeed,  if  P  =  0  the  estimator  of  a  would  be  y  —  172.70  and 
this  is  too  far  from  211. 

Example  7.12  Let  us  now  consider  the  multivariate  regression  in  the  “classic  blue” 
pullovers  example.  From  Example  3.15  we  know  that  the  estimated  parameters  in 
the  model 


Xi  —  a  +  pxX2  +  p2X2  +  P3X4  +  s 


are 


a  =  65.670,  px  =  -0.216,  p2  =  0.485,  p3  =  0.844. 

Hence,  we  could  postulate  the  approximate  relation: 

Pl  & 

which  means  in  practice  that  augmenting  the  price  by  20  EUR  requires  the 
advertising  costs  to  increase  by  10  EUR  in  order  to  keep  the  number  of  pullovers 
sold  constant.  Vice  versa,  reducing  the  price  by  20  EUR  yields  the  same  result  as 
before  if  we  reduced  the  advertising  costs  by  10  EUR.  Let  us  now  test  whether  the 
hypothesis 


H{)  '■  Pi  —  —-^2 

is  valid.  This  is  equivalent  to 


The  LR  statistic  in  this  case  is  equal  to  (Q  MVAlrtest) 


2  log  A  =  0.012, 
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the  F  statistic  is 


F  =  0.007. 

Hence,  in  both  cases  we  will  not  reject  the  null  hypothesis. 


Comparison  of  Two  Mean  Vectors 

In  many  situations,  we  want  to  compare  two  groups  of  individuals  for  whom  a  set 
of  p  characteristics  has  been  observed.  We  have  two  random  samples  {v/i  }"!=]  and 
{x7  2}y  =  i  from  two  distinct  p-v ariate  normal  populations.  Several  testing  issues  can 
be  addressed  in  this  framework.  In  Test  Problem  8  we  will  first  test  the  hypothesis 
of  equal  mean  vectors  in  the  two  groups  under  the  assumption  of  equality  of  the  two 
covariance  matrices.  This  task  can  be  solved  by  adapting  Test  Problem  2. 

In  Test  Problem  9  a  procedure  for  testing  the  equality  of  the  two  covariance 
matrices  is  presented.  If  the  covariance  matrices  differ,  the  procedure  of  Test 
Problem  8  is  no  longer  valid.  If  the  equality  of  the  covariance  matrices  is  rejected,  an 
easy  rule  for  comparing  two  means  with  no  restrictions  on  the  covariance  matrices 
is  provided  in  Test  Problem  10. 


Test  Problem  8.  Assume  that  Xu  ~  Np{fi\,  X),  with  i  =  1, . . . ,  n\  and 
Xj 2  ~  Np(pb 2,  £),  with  j  —  1, . . . ,  ri2,  where  all  the  variables  are  independent. 

H0  :  p\  =  fi2,  versus  H\  :  no  constraints. 


Both  samples  provide  the  statistics  xp  and  Sp,k  —  1,2.  Let  8  —  ji\  —  /i 2.  We 
have 


(x1-x2)~Np(s,  ni+”2A  (7.11) 

V  nln2  J 

n\S\  +  U2S2  ~  Wp(^j,n\  +  YI2  —  2).  (7.12) 

Let  S  —  {n  1  +  ri2)~l{n\S\  +  ^2^*2)  be  the  weighted  mean  of  S\  and  S2.  Since  the 
two  samples  are  independent  and  since  Sk  is  independent  of  Xf  (for  h  —  1 , 2)  it 
follows  that  S  is  independent  of  (x\  —  X2).  Hence,  Theorem  5.8  applies  and  leads  to 
a  7 2 -distribution: 


n\n2(n\  +  n2  -  2) 


{(x\  -  x2)  -  5}t  S  1  {(x\  -  x2)  -  5}) 


T 2 

p,n\+n2~  2 


(7.13) 


(n  1  +  n2)2 
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or 


{(xi  -  x2)  -  <$}T  S  1  {(xi  -  x2)  -  5} 


p(n\  +  n2): 


(«i  +  n2  -  p  -  l)«i«2 


F 


p,n\+n2—p—l 


This  result,  as  in  Test  Problem  2,  can  be  used  to  test  Ho  :  8  =  0  or  to  construct  a 
confidence  region  for  8  e  IT7.  The  rejection  region  is  given  by: 


«1«2(«1  +  «2  -  P  ~  1) 


(xi  x2)  S  (xi  x2)  ^  F\—a-nji\-\-n2—p—\.  (7.14) 


/K»i  +  n2y 


A  (1  —  a)  confidence  region  for  8  is  given  by  the  ellipsoid  centred  at  (xi  —  x2) 


{5-  (xi-x2)}t<S  1  {8-  (xi-x2)}  < 


p(n  i  +  n2y 


(«i  +  n2-p-\)(nin2) 


Fl— a;p, n  1+112— p—ly 


and  the  simultaneous  confidence  intervals  for  all  linear  combinations  aT 8  of  the 
elements  of  8  are  given  by 


a'  8  e  a  1  {x\  —  x2 )  ± 


,T /- 


p(n\  +  n2)2 


(tii  +  n2  -  p  -  l)(«i«2) 


Fl—a',p,n\  +n2—p—\^'^^' 


In  particular  we  have  at  the  (1  —  a)  level,  for  j  —  l, . . . ,  p. 


8j  e  (x  \  j  X2  j  )  i 


p(m  +  «2)2 


(til  +  n2  -  p  -  1)(«iH2) 


F, 


l-a;p,n  i  +ri2—p—l  sjj 1 


(7.15) 


Example  7.13  Let  us  come  back  to  the  questions  raised  in  Example  7.5.  We 
compare  the  means  of  assets  (Xi)  and  of  sales  ( X2 )  for  two  sectors,  energy  (group 
1)  and  manufacturing  (group  2).  With  n\  —  15,  —  10,  and  p  —  2  we  obtain  the 

statistics: 


X\ 


f  4084. 0\  _ 

V  2580.5  )  ’ 


/  4307. 2\ 
v  4925.2  ) 


and 


Si  =  107 


1.6635  1.2410\ 
1.2410  1.3747  )  ’  2 


7  /  1.2248  1 . 1 425  \ 
v  1.1425  1.5112 )  ’ 


so  that 


1.4880  1.2016 
1.2016  1.4293 
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The  observed  value  of  the  test  statistic  (7.14)  is  F  —  2.7036.  Since  ^0.9552,22  = 
3.4434  the  hypothesis  of  equal  means  of  the  two  groups  is  not  rejected  although 
it  would  be  rejected  at  a  less  severe  level  ( F  >  F0. 9052,22  =  2.5613).  By  directly 
applying  (7.15),  the  95  %  simultaneous  confidence  intervals  for  the  differences  (Q 
MVAsimcidif)  are  obtained  as: 

-4628.6  <  ilia  -  f!2a  <  4182.2 

—6662.4  <  ii\s  —  fi2 s  <  1973.0. 

Example  7.14  In  order  to  illustrate  the  presented  test  procedures  it  is  interesting  to 
analyse  some  simulated  data.  This  simulation  will  point  out  the  importance  of  the 
covariances  in  testing  means.  We  created  two  independent  normal  samples  in  M4  of 
sizes  n\  —  30  and  ri2  —  20  with: 


Hi  =  (8,6, 10, 10)T 
Hi  =  (6,6,10,  13)t. 

One  may  consider  this  as  an  example  of  X  —  (Xi, . . . ,  Xn)T  being  the  students’ 
scores  from  four  tests,  where  the  two  groups  of  students  were  subjected  to  two 
different  methods  of  teaching.  First  we  simulate  the  two  samples  with  E  =  Z4  and 
obtain  the  statistics: 

Xi  =  (7.607,5.945, 10.213,  9.635)t 

x2  =  (6.222,6.444,9.560, 13.041)T 

/  0.812  -0.229  -0.034  0.073\ 

-0.229  1.001  0.010  -0.059 

1_  -0.034  0.010  1.078  -0.098 

V  0.073  -0.059-0.098  0.823/ 

/  0.559  -0.057  -0.271  0.306\ 

_  -0.057  1.237  0.181  0.021 

2_  -0.271  0.181  1.159-0.130  ' 

\  0.306  0.021  -0.130  0.683/ 

The  test  statistic  (7.14)  takes  the  value  F  —  60.65  which  is  highly  significant: 
the  small  variance  allows  the  difference  to  be  detected  even  with  these  relatively 
moderate  sample  sizes.  We  conclude  (at  the  95  %  level)  that: 

0.6213  <  Si  <  2.2691 

—1.5217  <  82  <  0.5241 

-0.3766  <83<  1.6830 

-4.2614  <  54  <  -2.5494 


which  confirms  that  the  means  for  X 1  and  X4  are  different. 
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Consider  now  a  different  simulation  scenario  where  the  standard  deviations  are 
four  times  larger:  £  =  I6Z4.  Here  we  obtain: 

Xi  =  (7.312,6.304, 10.840, 10.902)1" 

x2  =  (6.353,5.890,8.604, 11. 283)t 

/  21.907  1.415  -2.050  2.379N 

1.415  11.853  2.104  -1.864 

1  ~~  -2.050  2.104  17.230  0.905 

V  2.379  -1.864  0.905  9.037/ 

/  20.349  -9.463  0.958  -6.507  \ 

-9.463  15.502  -3.383  -2.551 

2  ~~  0.958  -3.383  14.470  -0.323  ' 

V -6.507  -2.551  -0.323  10.311/ 


Now  the  test  statistic  takes  the  value  1 .54  which  is  no  longer  significant  (F0.95, 4, 45  = 
2.58).  Now  we  cannot  reject  the  null  hypothesis  (which  we  know  to  be  false!)  since 
the  increase  in  variances  prohibits  the  detection  of  differences  of  such  magnitude. 

The  following  situation  illustrates  once  more  the  role  of  the  covariances  between 
covariates.  Suppose  that  £  =  I6Z4  as  above  but  with  044  =  041  =  —3.999  (this 
corresponds  to  a  negative  correlation  r^\  —  —0.9997).  We  have: 

Xi  =  (8.484,5.908,9.024,  10.459)t 

x2  =  (4.959,7.307,  9.057,  13.803)T 

/  14.649-0.024  1.248  -3.961  \ 

-0.024  15.825  0.746  4.301 
1_  1.248  0.746  9.446  1.241 

V  —3.961  4.301  1.241  20.002/ 

/  14.035  -2.372  5.596  -1.601  \ 

-2.372  9.173  -2.027  -2.954 

2_  5.596  -2.027  9.021  -1.301  ' 

V  —1.601  -2.954  -1.301  9.593/ 

The  value  of  F  is  3.853  which  is  significant  at  the  5  %  level  ( -value  =  0.0089). 
So  the  null  hypothesis  S  —  Hi  —  /i2  =  0  is  outside  the  95  %  confidence  ellipsoid. 
However,  the  simultaneous  confidence  intervals,  which  do  not  take  the  covariances 
into  account  are  given  by: 


-0.1837  <  Si  <  7.2343 
-4.9452  <  S2  <  2.1466 
-3.0091  <  S3  <  2.9438 
-7.2336  <  S4  <  0.5450. 
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They  contain  the  null  value  (see  Remark  7.1  above)  although  they  are  very 
asymmetric  for  8\  and  84. 

Example  7.15  Let  us  compare  the  vectors  of  means  of  the  forged  and  the  genuine 
bank  notes.  The  matrices  Sf  and  Sg  were  given  in  Example  3.1  and  since  here 
n  f  —  ng  —  100,  S  is  the  simple  average  of  Sf  and  Sg  :  S  =  \  (Sf  +  Sg). 

Xg  =  (214.97, 129.94, 129.72, 8.305, 10.168, 141.52) T 
xf  =  (214.82,  130.3,  130.19, 10.53, 1 1.133, 139.45)T. 

The  test  statistic  is  given  by  (7.14)  and  turns  out  to  be  F  —  391.92  which  is  highly 
significant  for  193.  The  95  %  simultaneous  confidence  intervals  for  the  differences 
Sj  =  figj  -  Hfj,  j  =  1 , . . . , /?  are: 


— 0.0443  <  Si  <  0.3363 
-0.5186  <  82  <  -0.1954 
-0.6416  <  S3  <  -0.3044 
-2.6981  <84<  -1.7519 
-1.2952  <  S5  <  -0.6348 
1.8072  <86<  2.3268. 

All  of  the  components  (except  for  the  first  one)  show  significant  differences  in  the 
means.  The  main  effects  are  taken  by  the  lower  border  ( X4 )  and  the  diagonal  (3^6). 

The  preceding  test  implicitly  uses  the  fact  that  the  two  samples  are  extracted 
from  two  different  populations  with  common  variance  E.  In  this  case,  the  test 
statistic  (7.14)  measures  the  distance  between  the  two  centers  of  gravity  of  the  two 
groups  w.r.t.  the  common  metric  given  by  the  pooled  variance  matrix  S.  If  Ei  ^  E2 
no  such  matrix  exists.  There  are  no  satisfactory  test  procedures  for  testing  the 
equality  of  variance  matrices  which  are  robust  with  respect  to  normality  assumptions 
of  the  populations.  The  following  test  extends  Bartlett’s  test  for  equality  of  variances 
in  the  univariate  case.  But  this  test  is  known  to  be  very  sensitive  to  departures  from 
normality. 


Test  Problem  9  (Comparison  of  Covariance  Matrices).  Let  X p  (/x/j,  E/j), 

i  —  1, ...  ,rih,h  =  1 be  independent  random  variables, 

Hq  :  Ei  =  E2  =  •  •  •  =  E^  versus  H\  :  no  constraints. 


Each  sub-sample  provides  Sh ,  an  estimator  of  E/2,  with 


UhSh  ~  Wp(Eh,nh  -  1). 
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Under  H0,  J2h=i  nh$h  Wp(E,n  —  k )  (Sect. 5.2),  where  E  is  the  common 

covariance  matrix  and  n  =  Ylh=i  nh  •  Let  S  —  /n5l+’/‘;’+;?A"SA'  be  the  weighted 
average  of  the  SjT  (this  is  in  fact  the  MLE  of  E  when  Ho  is  true).  The  LRT  leads  to 
the  statistic 


k 

—  2  log  A  =  n  log  |  S  |  —  ^  rib  log  |  Sh  |  (2.16) 

h=  1 

which  under  Ho  is  approximately  distributed  as  a  A2  where  m  —  \(k  —  \)p(p  +  \). 

Example  7.16  Let’s  come  back  to  Example  7.13,  where  the  mean  of  assets  and 
sales  have  been  compared  for  companies  from  the  energy  and  manufacturing  sector 
assuming  that  Ei  =  E2.  The  test  of  Ei  =  E2  leads  to  the  value  of  the  test  statistic 

—  2  log  A  =  0.9076  (7.17) 

which  is  not  significant  (p -value  for  a  —  0.82).  We  cannot  reject  Ho  and  the 
comparison  of  the  means  performed  above  is  valid. 

Example  7.17  Let  us  compare  the  covariance  matrices  of  the  forged  and  the  genuine 
bank  notes  (the  matrices  S /  and  Sg  are  shown  in  Example  3.1).  A  first  look  seems 
to  suggest  that  Ei  ^  E2.  The  pooled  variance  S  is  given  by  S  —  \  (Sf  +  5^) 
since  here  nj  —  ng.  The  test  statistic  here  is  —2  log  A  =  127.21,  which  is  highly 
significant  /2  with  21  degrees  of  freedom.  As  expected,  we  reject  the  hypothesis 
of  equal  covariance  matrices,  and  as  a  result  the  procedure  for  comparing  the  two 
means  in  Example  7.15  is  not  valid. 

What  can  we  do  with  unequal  covariance  matrices?  When  both  n\  and  ri2  are  large, 
we  have  a  simple  solution: 


Test  Problem  10  (Comparison  of  Two  Means,  Unequal  Covariance  Matrices, 
Large  Samples).  Assume  that  Xu  ~  Np(/jl  1,  Ei),  with  i  =  l, ...  ,n\  and 
Xj  2  ~  Np(fi  2,  E2),  with  j  —  \, ...  ,ri2  are  independent  random  variables. 

Hq  :  i±\  —  fi2  versus  H\  :  no  constraints. 


Letting  8  —  —  pi 2,  we  have 


x2)  ~  N 


Si 

n\ 


+ 
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Therefore,  by  (5.4) 


(X\  -  x2)t  (—  +  — 

V«1  «2 

Since  Si  is  a  consistent  estimator  of  £/  for  i  —  1, 2,  we  have 

(*i  -  x2)t  (—  +  — )  (xi-x2) -y  x2p-  (7.18) 

V«i  «2 / 

This  can  be  used  in  place  of  (7.13)  for  testing  Ho,  defining  a  confidence  region  for 
8  or  constructing  simultaneous  confidence  intervals  for  8  j ,  j  —  1 , . . . ,  p. 

For  instance,  the  rejection  region  at  the  level  a  will  be 

(Xi  -  x2)t  (—  +  — )  (XI  -  X2)  >  x\ -crp  (7.19) 

V«i  n2)  1 

and  the  (1  —  o')  simultaneous  confidence  intervals  for  8j,  j  —  1, ... ,  p  are: 


-l 


(xi  -  x2)  ~  x2p- 


8j  G  (xi  —  X2)  =t 


(7.20) 


where  is  the  (7, 7)  element  of  the  matrix  <S*.  This  may  be  compared  to  (7.15) 
where  the  pooled  variance  was  used. 

Remark  7.2  We  see,  by  comparing  the  statistics  (7.19)  with  (7.14),  that  we  measure 

here  the  distance  between  x,  and  x2  using  the  metric  (^  +  |).  It  should  be 

noted  that  when  n\  —  ri2 ,  the  two  methods  are  essentially  the  same  since  then 
S  —  \  (<Si  +  <S2).  If  the  covariances  are  different  but  have  the  same  eigenvectors 
(different  eigenvalues),  one  can  apply  the  common  principal  component  (CPC) 
technique,  see  Chap.  11. 

Example  7.18  Let  us  use  the  last  test  to  compare  the  forged  and  the  genuine  bank 
notes  again  ( n  \  and  ri2  are  both  large).  The  test  statistic  (7.19)  turns  out  to  be  2,436.8 
which  is  again  highly  significant.  The  95  %  simultaneous  confidence  intervals  are: 


-0.0389  <  Si  <  0.3309 

-0.5140  <  82  <  -0.2000 
-0.6368  <83<  -0.3092 
-2.6846  <  84  <  -1.7654 
-1.2858  <85<  -0.6442 
1.8146  <S6<  2.3194 

showing  that  all  the  components  except  the  first  are  different  from  zero,  the  largest 
difference  coming  from  X6  (length  of  the  diagonal)  and  X4  (lower  border).  The 
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results  are  very  similar  to  those  obtained  in  Example  7.15.  This  is  due  to  the  fact 
that  here  as  we  already  mentioned  in  the  remark  above. 


Profile  Analysis 

Another  useful  application  of  Test  Problem  6  is  the  repeated  measurements  problem 
applied  to  two  independent  groups.  This  problem  arises  in  practice  when  we 
observe  repeated  measurements  of  characteristics  (or  measures  of  the  same  type 
under  different  experimental  conditions)  on  the  different  groups  which  have  to  be 
compared.  It  is  important  that  the  p  measures  (the  “profile”)  are  comparable,  and, 
in  particular,  are  reported  in  the  same  units.  For  instance,  they  may  be  measures 
of  blood  pressure  at  p  different  points  in  time,  one  group  being  the  control  group 
and  the  other  the  group  receiving  a  new  treatment.  The  observations  may  be  the 
scores  obtained  from  p  different  tests  of  two  different  experimental  groups.  One  is 
then  interested  in  comparing  the  profiles  of  each  group:  the  profile  being  just  the 
vectors  of  the  means  of  the  p  responses  (the  comparison  may  be  visualised  in  a 
two-dimensional  graph  using  the  parallel  coordinate  plot  introduced  in  Sect.  1.7). 
We  are  thus  in  the  same  statistical  situation  as  for  the  comparison  of  two  means: 

Xn  ~  Np  (ji i,  E)  i  =  1, . . .  ,m 

X;2  ~  Np  (/X2,  E)  /  =  1, ...  ,n2, 

where  all  variables  are  independent.  Suppose  the  two  population  profiles  look  like 
in  Fig.  7.F 

The  following  questions  are  of  interest: 

1 .  Are  the  profiles  similar  in  the  sense  of  being  parallel  (which  means  no  interaction 
between  the  treatments  and  the  groups)? 

2.  If  the  profiles  are  parallel,  are  they  at  the  same  level? 

3.  If  the  profiles  are  parallel,  is  there  any  treatment  effect,  i.e.  are  the  profiles 
horizontal  (profiles  remain  the  same  no  matter  which  treatment  received)? 

The  above  questions  are  easily  translated  into  linear  constraints  on  the  means  and  a 
test  statistic  can  be  obtained  accordingly. 


Parallel  Profiles 


/ 1  — 1  O---  0\ 

0  1-1--  0 

\0  •••  0  1  -1/ 


Fet  C  be  a  (/>  —  1)  x  p  matrix  defined  as  C  — 
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Population  Profiles 


Fig.  7.1  Example  of  population  profiles  Q  MVAprof  il 


The  hypothesis  to  be  tested  is 

//0(1)  :  C(ji  i  -  /ii)  —  0. 

From  (7.1 1),  (7.12)  and  Corollary  5.4  we  know  that  under  Ho : 

+  «2  -  2)  {C(x,  -  x2)}t  (C5Ct)"1C(x,  -  x2)  ~  T2p_hni+ni_ 2, 

(7.21) 


where  <S  is  the  pooled  covariance  matrix.  The  hypothesis  is  rejected  if 

n\n2(n\ 

(«i  +  n 


(Cx)T(C<SCT)  '  Cx  > 


.T'-1 


2)2(P  -  1) 


Equality  of  Two  Levels 

The  question  of  equality  of  the  two  levels  is  meaningful  only  if  the  two  profiles  are 
parallel.  In  the  case  of  interactions  (rejection  of  T70(1)),  the  two  populations  react 
differently  to  the  treatments  and  the  question  of  the  level  has  no  meaning. 

The  equality  of  the  two  levels  can  be  formalised  as 

//0<2)  :  lj(/ii  -  jii )  =  0 
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since 


and 


(n\  +  ^2)1  J«S1  p  ~  (1  «i  +  /12  —  2). 


Using  Corollary  5.4  we  have  that: 


l,w  1  TW2 — 2 


(7.22) 


—  ^l,«i+;?2— 2. 


The  rejection  region  is 


Treatment  Effect 

If  it  is  rejected  that  the  profiles  are  parallel,  then  two  independent  analyses  should 
be  done  on  the  two  groups  using  the  repeated  measurement  approach.  But  if  it  is 
accepted  that  they  are  parallel,  then  we  can  exploit  the  information  contained  in 
both  groups  (possibly  at  different  levels)  to  test  a  treatment  effect,  i.e.  if  the  two 
profiles  are  horizontal.  This  may  be  written  as: 


Hq  ^  :  C(/Xi  +  fi 2)  —  0. 


Consider  the  average  profile  x 


n\X\  +  n  2X2 


nx  +  n  2 


Clearly, 


7.2  Linear  Hypothesis 


241 


Now  it  is  not  hard  to  prove  that  Hq  with  Hq  implies  that 


C 


n \H\  +  «2/t-2\ 


«1  +  «2 


7 


So  under  parallel,  horizontal  profiles  we  have 


4-  TI2CX  ~  Np (0,  CSC4") 


From  Corollary  5.4  we  again  obtain 


(»i  +  n2  -  2)(Cx)t(C<SC  1  )~'Cx  ~  Tj_Ul+n2_2. 
This  leads  to  the  rejection  region  of  Hq,  namely 

+-^|ij:^(CT)T(C<SCT)-1Cx  >  Fi-a;p-Uni+n2- 


T \—\r>- 


P 


P 


(7.23) 


Example  7.19  Morrison  (1990)  proposed  a  test  in  which  the  results  of  four  sub-tests 
of  the  Wechsler  Adult  Intelligence  Scale  (WAIS)  are  compared  for  two  categories  of 
people:  group  1  contains  n\  —  37  people  who  do  not  have  a  senile  factor  and  group 
2  contains  ft  2  =  12  people  who  have  a  senile  factor.  The  four  WAIS  sub-tests  are  X\ 
(information),  X2  (similarities),  X3  (arithmetic)  and  X4  (picture  completion).  The 
relevant  statistics  are 


X\ 

x 2 


(12.57,9.57, 11.49,  7.97)t 

(8.75,  5.33,  8.50,  4.75)t 

/  11.164  8.840  6.210  2.020 \ 
8.840  11.759  5.778  0.529 
6.210  5.778  10.790  1.743 
V  2.020  0.529  1.743  3.594/ 

/  9.688  9.583  8.875  7.021  \ 
9.583  16.722  11.083  8.167 
8.875  11.083  12.083  4.875 
\  7.021  8.167  4.875  11.688/ 


The  test  statistic  for  testing  if  the  two  profiles  are  parallel  is  F  —  0.4634,  which 
is  not  significant  (//-value  =  0.71).  Thus  it  is  accepted  that  the  two  are  parallel. 
The  second  test  statistic  (testing  the  equality  of  the  levels  of  the  two  profiles)  is 
F  —  17.21,  which  is  highly  significant  (//-value  %  10-4).  The  global  level  of  the 
test  for  the  non-senile  people  is  superior  to  the  senile  group.  The  final  test  (testing 
the  horizontality  of  the  average  profile)  has  the  test  statistic  F  —  53.32,  which  is 
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also  highly  significant  (/7-value  %  10  14).  This  implies  that  there  are  substantial 
differences  among  the  means  of  the  different  sub  tests. 


Summary 

^  Hypotheses  about  ji  can  often  be  written  as  Afi  —  a ,  with  matrix 
A,  and  vector  a . 

^  The  hypothesis  Ho  :  Afi  —  a  for  X  ~  Np(fi,  E)  with  E  known 
leads  to  —2  log  A  =  n(Ax  —  a)T (A^AT)~l  (Ax  —  a)  ~  where 
q  is  the  number  of  elements  in  a . 

^  The  hypothesis  Ho  :  A{i  —  a  for  X  ~  E)  with  E  unknown 

leads  to  —2  log  A  =  n  log{l  +  (Ax— a)T  (ASAT)~l(Ax  — a)}  — > 

Xg ,  where  q  is  the  number  of  elements  in  a  and  we  have  an  exact 
test  (n  —  1  )(Ax  —  a)T (AS AT)~l  (Ax  —  a)  ~  T^n_v 

The  hypothesis  Ho  :  *4/3  =  a  for  7?  ~  Ni(/3T Xi ,  cr2)  with  o2 
unknown  leads  to  —2  log  A  =  f  log  f  —  l)  — >  y2,,  with 

6  2  B\\\y-Xp\\2  ) 

q  being  the  length  of  a  and  with 


n  —  p 

q 


(M-a\  U(XtX)  ‘^t}  '  (AP-a) 

(y  -  xpf  (y-x 


^  P1 

rq,n—p  • 


7.3  Boston  Housing 

Returning  to  the  Boston  Housing  data  set,  we  are  now  in  a  position  to  test  if  the 
means  of  the  variables  vary  according  to  their  location,  for  example,  when  they  are 
located  in  a  district  with  high  valued  houses.  In  Chap.  1,  we  built  two  groups  of 
observations  according  to  the  value  of  Xu  being  less  than  or  equal  to  the  median  of 
X14  (a  group  of  256  districts)  and  greater  than  the  median  (a  group  of  250  districts). 
In  what  follows,  we  use  the  transformed  variables  motivated  in  Sect.  1.9. 

Testing  the  equality  of  the  means  from  the  two  groups  was  proposed  in  a 
multivariate  setup,  so  we  restrict  the  analysis  to  the  variables  X\,  Xs,  Xu,  and 
X\3  to  see  if  the  differences  between  the  two  groups  that  were  identified  in  Chap.  1 
can  be  confirmed  by  a  formal  test.  As  in  Test  Problem  8,  the  hypothesis  to  be  tested  is 

Ho  :  /i\  —  /X2,  where  fi\  e  R5,n \  —  256,  and  7^2  =  250. 
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X  is  not  known.  The  F-statistic  given  in  (7.13)  is  equal  to  126.30,  which  is  much 
higher  than  the  critical  value  ^o.95;5,500  —  2.23.  Therefore,  we  reject  the  hypothesis 
of  equal  means. 

To  see  which  component,  X\,  X5,  X%,  Xu,  or  X\$,  is  responsible  for  this 
rejection,  take  a  look  at  the  simultaneous  confidence  intervals  defined  in  (7.14): 

Si  G  (  1.4020,  2.5499) 

$5  g  (  0.1315,  0.2383) 

S8  g  (-0.5344,-0.2222) 

$n  g  (  1.0375,  1.7384) 

$13  g  (  1.1577,  1.5818). 

These  confidence  intervals  confirm  that  all  of  the  Sj  are  significantly  different  from 
zero  (note  there  is  a  negative  effect  for  X%:  weighted  distances  to  employment 
centers)  Q  MVAsimcibh. 

We  could  also  check  if  the  factor  “being  bounded  by  the  river”  (variable 
X4)  has  some  effect  on  the  other  variables.  To  do  this  compare  the  means  of 
(X5,  Xg,  Xg,  X\2,  X13,  Xi4)T.  There  are  two  groups:  n\  =  35  districts  bounded 
by  the  river  and  ri2  —  471  districts  not  bounded  by  the  river.  Test  Problem  8 
(Hq  :  fii  —  ii2)  is  applied  again  with  p  —  6.  The  resulting  test  statistic,  F  —  5.81, 
is  highly  significant  (.Fo.95;6,499  =  2.12).  The  simultaneous  confidence  intervals 
indicate  that  only  Xu  (the  value  of  the  houses)  is  responsible  for  the  hypothesis 
being  rejected.  At  a  significance  level  of  0.95 

$5  g  (-0.0603,0.1919) 

$8  g  (-0.5225,0.1527) 

$9  g  (-0.5051,0.5938) 

$12  g  (-0.3974,0.7481) 

$13  g  (-0.8595,0.3782) 

$14  g  (  0.0014,0.5084). 

Testing  Linear  Restrictions 

In  Chap.  3  a  linear  model  was  proposed  that  explained  the  variations  of  the  price  Xu 
by  the  variations  of  the  other  variables.  Using  the  same  procedure  that  was  shown 
in  Testing  Problem  7,  we  are  in  a  position  to  test  a  set  of  linear  restrictions  on  the 
vector  of  regression  coefficients  /3. 
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The  model  we  estimated  in  Sect.  3.7  provides  the  following  (O  MVAl  inregbh): 


Variable 

✓V 

Pj 

SE(fij) 

t 

p- Value 

Constant 

4.1769 

0.3790 

1 1 .020 

0.0000 

X\ 

-0.0146 

0.0117 

-1.254 

0.2105 

x2 

0.0014 

0.0056 

0.247 

0.8051 

Xi 

-0.0127 

0.0223 

-0.570 

0.5692 

X. 4 

0.1100 

0.0366 

3.002 

0.0028 

X5 

-0.2831 

0.1053 

-2.688 

0.0074 

X6 

0.4211 

0.1102 

3.822 

0.0001 

Xj 

0.0064 

0.0049 

1.317 

0.1885 

x S 

-0.1832 

0.0368 

-4.977 

0.0000 

x9 

0.0684 

0.0225 

3.042 

0.0025 

Xio 

-0.2018 

0.0484 

-4.167 

0.0000 

Xn 

-0.0400 

0.0081 

-4.946 

0.0000 

Xn 

0.0445 

0.0115 

3.882 

0.0001 

X13 

-0.2626 

0.0161 

-16.320 

0.0000 

Recall  that  the  estimated  residuals  Y  —  Xp  did  not  show  a  big  departure 
from  normality,  which  means  that  the  testing  procedure  developed  above  can  be 
used. 

1 .  First  a  global  test  of  significance  for  the  regression  coefficients  is  performed, 

Ho  :  (Pi, ,  /3n)  =  0. 

This  is  obtained  by  defining  A  =  (O13,  Z13)  and  a  —  O13  so  that  Ho  is  equivalent 
to  Aft  —  a  where  ft  —  (Po,  Pi,  •  •  • ,  P\3)T  -  Based  on  the  observed  values  F  — 
123.20.  This  is  highly  significant  (^0.95;  13,492  =  1.7401),  thus  we  reject  Ho.  Note 

that  under  HoPh0  =  (3.0345, 0, . . . ,  0)  where  3.0345  =  y. 

2.  Since  we  are  interested  in  the  effect  that  being  located  close  to  the  river  has  on 
the  value  of  the  houses,  the  second  test  is  Ho  :  Pa  —  0-  This  is  done  by  fixing 

A  =  (0, 0, 0, 0, 1 , 0, 0, 0, 0, 0, 0, 0, 0, 0)T 

and  a  —  0  to  obtain  the  equivalent  hypothesis  Ho  :  AP  —  a .  The  result  is  again 
significant:  F  —  9.0125  (^ro.95;i,492  =  3.8604)  with  a  77-value  of  0.0028.  Note 
that  this  is  the  same  p -value  obtained  in  the  individual  test  Pa  =  0  in  Chap.  3, 
computed  using  a  different  setup. 

3.  A  third  test  notices  the  fact  that  some  of  the  regressors  in  the  full  model  (3.57) 
appear  to  be  insignificant  (that  is  they  have  high  individual  p -values).  It  can 
be  confirmed  from  a  joint  test  if  the  corresponding  reduced  model,  formulated 
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Table  7.1  Linear  regression 
for  Boston  housing  data  set  Q 
MVAl inreg2bh 


Variable 

yv 

Pi 

SE 

t 

p- Value 

Const 

4.1582 

0.3628 

11.462 

0.0000 

X, 

0.1087 

0.0362 

2.999 

0.0028 

X5 

-0.3055 

0.0973 

-3.140 

0.0018 

X6 

0.4668 

0.1059 

4.407 

0.0000 

Xs 

-0.1855 

0.0327 

-5.679 

0.0000 

X 

0.0492 

0.0183 

2.690 

0.0074 

Xo 

-0.2096 

0.0446 

-4.705 

0.0000 

x„ 

-0.0410 

0.0078 

-5.280 

0.0000 

Xl2 

0.0481 

0.0112 

4.306 

0.0000 

X13 

-0.2588 

0.0149 

-17.396 

0.0000 

by  deleting  the  insignificant  variables,  is  rejected  by  the  data.  We  want  to  test 
H0  :  Pi  —  Pi  —  Pi  =  Pi  —  0.  Hence, 

/0  100000000000  0\ 
00100000000000 
01010000000000 
Vo  100000100000  0/ 

and  a  =  O4.  The  test  statistic  is  0.9344,  which  is  not  significant  for  ^4,492.  Given 
that  the  77-value  is  equal  to  0.44,  we  cannot  reject  the  null  hypothesis  nor  the 

/V 

corresponding  reduced  model.  The  value  of  P  under  the  null  hypothesis  is 
Ph0  =  (4.16, 0, 0, 0, 0.11,  -0.31, 0.47, 0,  -0.19, 0.05,  -0.20,  -0.04, 0.05,  -0.26)T . 
A  possible  reduced  model  is 

X14  —  fto  +  P4X4  +  P5X5  +  PeXe  +  PsXs  +  •  •  •  +  ^13^13  +  £• 

Estimating  this  reduced  model  using  OLS,  as  was  done  in  Chap.  3,  provides  the 
results  shown  in  Table  7.1. 

Note  that  the  reduced  model  has  r2  =  0.763  which  is  very  close  to  r2  =  0.765 
obtained  from  the  full  model.  Clearly,  including  variables  X\ ,  X2,  X3,  and  X7 
does  not  provide  valuable  information  in  explaining  the  variation  of  Xi4,  the 
price  of  the  houses. 
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7.4  Exercises 


Exercise  7.1  Use  Theorem  7.1  to  derive  a  test  for  testing  the  hypothesis  that  a  dice 
is  balanced,  based  on  n  tosses  of  that  dice.  (Hint:  use  the  multinomial  probability 
function.) 

Exercise  7.2  Consider  N^pt,  £).  Formulate  the  hypothesis  Ho  :  pt\  =  p2  —  M3  in 
terms  of  Apt  —  a. 

Exercise  7.3  Simulate  a  normal  sample  with  pt  —  (*)  and  £  =  ^()'5  °25  ^  and  test 

Ho  :  2/xi  —pt 2  —  0.2  first  with  £  known  and  then  with  £  unknown.  Compare  the 
results. 


Exercise  7.4  Derive  expression  (7.3)  for  the  LRT  statistic  in  Test  Problem  2. 

Exercise  7.5  With  the  simulated  data  set  of  Example  7.14,  test  the  hypothesis  of 
equality  of  the  covariance  matrices. 

Exercise  7.6  In  the  US  companies  data  set,  test  the  equality  of  means  between  the 
energy  and  manufacturing  sectors,  taking  the  full  vector  of  observations  X\  to  X§. 
Derive  the  simultaneous  confidence  intervals  for  the  differences. 

/  2  —1  \ 

Exercise  7.7  Let  X  ~  A^(m»  £)  where  £  is  known  to  be  £  =  I  J.  We 

have  an  i.i.d.  sample  of  size  n  —  6  providing  x  T  =  (i  M  Solve  the  following  test 
problems  (a  —  0.05): 


(a)  Hq  : 

li  =  (2,  |)T 

^  : 

M  #  (2,  |)T 

(b)  Ho  : 

7 

Mi  M2  —  2 

Hi  : 

Mi  +  M2  ^ 

(c)  Ho  : 

Mi  —  M2  =  \ 

Hi  : 

Mi  -  M2  #  ^ 

(d)  Ho  '■ 

Mi  =  2  H\  : 

Mi  7^  2 

For  each  case,  represent  the  rejection  region  graphically  (comment). 


Exercise  7.8  Repeat  the  preceding  exercise  with  £  unknown  and  S  — 
Compare  the  results. 


2  -1 
-1  2 


Exercise  7.9  Consider  X  ~  N^(p,  £).  An  i.i.d.  sample  of  size  n  —  10  provides: 


x  =  (1,0,  2)t 
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(a)  Knowing  that  the  eigenvalues  of  S  are  integers ,  describe  a  95  %  confidence 

3 

region  for  p.  (Hint:  to  compute  eigenvalues  use  |S|  =  Yl  and  tr(S)  — 

7=1 

^2  •) 

/  =  • 

(b)  Calculate  the  simultaneous  confidence  intervals  for  pt\,  pt2  and  gt  3. 
fcj  Qm  we  assert  that  gt\  is  an  average  of  p 2  and  pt 3  ? 

Exercise  7.10  Consider  two  independent  i.i.d.  samples,  each  of  size  10,  from  two 
bivariate  normal  populations.  The  results  are  summarised  below: 

X\  =  (3, 1)T;  x2  =  (1, 1)T 


5 


1 


4  -1 
-1  2 


Provide  a  solution  to  the  following  tests: 

(a)  H0  :  fix  =  j±2  H\  :  ±  p2 

(b)  Ho  :  /Xu  =  M21  Hi  :  Mil  +  M21 

fcj  //o  :  M12  =  M22  7/i  :  M12  7 ^  M22 

Compare  the  solutions  and  comment. 

Exercise  7.11  Prove  expression  (7.4)  in  the  Test  Problem  2  with  log -likelihoods 
and  if  [Hint:  use  (2.29).] 

Exercise  7.12  Assume  that  X  ~  Np(gi,  E)  where  E  is  unknown. 

(a)  Derive  the  log  LRT  for  testing  the  independence  of  the  p  components,  that  is 

Ho  :  E  is  a  diagonal  matrix.  (Solution:  —2  log  A  =  —  n  log  \1Z\  where  1Z  is  the 
correlation  matrix,  which  is  asymptotically  a  under  Ho.) 

2  P(p  !) 

(b)  Assume  that  E  is  a  diagonal  matrix  (all  the  variables  are  independent).  Can  an 
asymptotic  test  for  Ho  :  p  —  \i0  against  H\  :  p  p0  be  derived?  How  would 
this  compare  to  p  independent  univariate  t -tests  on  each  gtj  ? 

(c)  Show  an  easy  derivation  of  an  asymptotic  test  for  testing  the  equality  of  the  p 
means  [Hint:  use  (C X)T  (C S CT)_1  C X  ->  X2p-\  where  S  —  diag(sn, . . .  ,spp) 
and  C  is  defined  as  in  (7.10)].  Compare  this  to  the  simple  AN OVA  procedure  used 
in  Sect.  3.5. 

Exercise  7.13  The  yields  of  wheat  have  been  measured  in  30  parcels  that  have  been 
randomly  attributed  to  three  lots  prepared  by  one  of  three  different  fertiliser  A,  B  and 
C.  The  data  are 
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Fertiliser  yield 

A 

B 

C 

1 

4 

6 

2 

2 

3 

7 

1 

3 

2 

7 

1 

4 

5 

5 

1 

5 

4 

5 

3 

6 

4 

5 

4 

7 

3 

8 

3 

8 

3 

9 

3 

9 

3 

9 

2 

10 

1 

6 

2 

Using  Exercise  7.12, 

(a)  test  the  independence  between  the  three  variables. 

(b)  test  whether  pT  —  [2  6  4]  and  compare  this  to  the  three  univariate  t -tests. 

(c)  test  whether  —  U3  using  simple  AN  OVA  and  the  / 2  approximation. 

Exercise  7.14  Consider  an  i.i.d.  sample  of  size  n  —  5  from  a  bivariate  normal 
distribution 


x~jvT(4))' 

where  p  is  a  known  parameter.  Suppose  xT  —  (1  0).  For  what  value  of  p  would  the 
hypothesis  Ho  :  /xT  =  (0  0)  be  rejected  in  favour  of  H\  :  /xT  ^  (0  0)  (at  the  5  % 
level)? 

Exercise  7.15  Using  Example  7.14,  test  the  last  two  cases  described  there  and  test 
the  sample  number  one  (n\  =  30 ),  to  see  if  they  are  from  a  normal  population  with 
£  =  4X4  (the  sample  covariance  matrix  to  be  used  is  given  by  S\). 

Exercise  7.16  Consider  the  bank  data  set.  For  the  counterfeit  bank  notes,  we  want 
to  know  if  the  length  of  the  diagonal  ( X e)  can  be  predicted  by  a  linear  model  in  X\ 
to  X5.  Estimate  the  linear  model  and  test  if  the  coefficients  are  significantly  different 
from  zero. 

Exercise  7.17  In  Example  7.10,  can  you  predict  the  vocabulary  score  of  the 
children  in  eleventh  grade,  by  knowing  the  results  from  grades  8-9  and  10?  Estimate 
a  linear  model  and  test  its  significance. 

Exercise  7.18  Test  the  equality  of  the  covariance  matrices  from  the  two  groups  in 
the  WAIS  subtest  (Example  7.19). 

Exercise  7.19  Prove  expressions  ( 7.21)-(7.23 ). 
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Exercise  7.20  Using  Theorem  6.3  and  expression  (7.16),  construct  an  asymptotic 
rejection  region  of  size  a  for  testing,  in  a  general  model  f  (x,  6),  with  6  e  M.k , 

Ho  :  6  —  6o  against  H\  :  0  6o. 

Exercise  7.21  Exercise  6.5  considered  the  pdf  f  (xi ,  X2 )  =  e2ei  e  '  01X2  0102  2 

1  2  ^2 

X\ ,  X2  >  0.  Solve  the  problem  of  testing  Ho  :  6T  =  (0qi  ,  $02)  from  an  iid  sample  of 
size  n  on  x  —  (x\ ,  X2)T,  where  n  is  large. 

Exercise  7.22  In  Olkin  and  Veath  (1980),  the  evolution  of  citrate  concentrations 
in  plasma  is  observed  at  three  different  times  of  day,  X\  (8  am),  X2  (11  am)  and 
X2  (3  pm),  for  two  groups  of  patients  who  follow  different  diets.  (The  patients  were 
randomly  attributed  to  each  group  under  a  balanced  design  n\  —  n2  —  5. ) 

The  data  are: 


Group 

X\  (8  am) 

X2  (11  am) 

X3  (3  pm) 

I 

125 

137 

121 

144 

173 

147 

105 

119 

125 

151 

149 

128 

137 

139 

109 

II 

93 

121 

107 

116 

135 

106 

109 

83 

100 

89 

95 

83 

116 

128 

100 

Test  if  the  profiles  of  the  groups  are  parallel,  if  they  are  at  the  same  level  and  if 
they  are  horizontal. 
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Multivariate  Techniques 


Chapter  8 

Regression  Models 


The  aim  of  regression  models  is  to  model  the  variation  of  a  quantitative  response 
variable  y  in  terms  of  the  variation  of  one  or  several  explanatory  variables 
(x\, . . . ,  xp)T .  We  have  already  introduced  such  models  in  Chaps.  3  and  7  where 
linear  models  were  written  in  (3.50)  as 

y  =  Xfi  +  s, 

where  y(n  x  1)  is  the  vector  of  observation  for  the  response  variable,  X(n  x  p)  is 
the  data  matrix  of  the  p  explanatory  variables  and  s  are  the  errors.  Linear  models 
are  not  restricted  to  handle  only  linear  relationships  between  y  and  x.  Curvature  is 
allowed  by  including  appropriate  higher  order  terms  in  the  design  matrix  X. 

Example  8.1  If  y  represents  response  and  X\,X2  are  two  factors  that  explain  the 
variation  of  y  via  the  quadratic  response  model: 

37  —  Po  P\%i\  +  PlXi2  +  @3X^1  +  P^xf2  +  @5Xi \Xi2  +  Si,  1  =  l,  .  .  .  ,H. 

(8.1) 

This  model  (8.1)  belongs  to  the  class  of  linear  models  because  it  is  linear  in  ft.  The 
data  matrix  X  is: 


/  1  Xu  Xu  x\x  x\2  X\\X\2  \ 

1  x2l  X22  x\x  x\2  X2XX22 

X  — 

\1  Xn\  Xn2  XrX  Xn2  Xn ]  Xn 2  / 

For  a  given  value  of  /},  the  response  surface  can  be  represented  in  a  three- 
dimensional  plot  as  in  Fig.  8.1  where  we  display  y  —  20  +  lx\  +  2x2  —  8xJ  — 
6x|  +  6x1X2,  i.e.  P  —  (20, 1,2,  —8,  —6,  +6)T. 
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Fig.  8.1  A  3-D  response  surface  Q  MVAresponsesurf  ace 


Note  also  that  pure  non-linear  models  can  sometimes  be  rewritten  as  a  linear 
model  by  choosing  an  appropriate  transformation  of  the  coordinates  of  the  variables. 
For  instance  the  Cobb-Douglas  production  function 


yi  =  kx 


Hi 
i  1 


X 


h 

i  3  ’ 


where  y  is  the  level  of  the  production  of  a  plant  and  (x\ ,  X2 ,  are  three  factors  of 

production  (e.g.  labour,  capital  and  energy),  can  be  transformed  into  a  linear  model 
in  the  log  scale.  We  have  indeed 


log  yt  =  0o  +  Pi  logx/i  +  02  log  xt 2  +  jM ogx,-3, 


where  0q  —  log  k  and  the  0j ,  j  —  1 , . . . ,  3  are  the  elasticities  (fij  = 

9  logy/9  log xj). 

Linear  models  are  flexible  and  cover  a  wide  class  of  models.  If  X  has  full 
rank,  they  can  easily  be  estimated  by  least  squares  0  —  (XT X)~l XT y  and  linear 
restrictions  on  the  0’s  can  be  tested  using  the  tools  developed  in  Chap.  7. 

In  Chap.  3,  we  saw  that  even  qualitative  explanatory  variables  can  be  used  by 
defining  appropriate  coding  of  the  nominal  values  of  x.  In  this  chapter,  we  will 
extend  our  toolbox  by  showing  how  to  code  these  qualitative  factors  in  a  way 
which  allows  the  introduction  of  several  qualitative  factors  including  the  possibility 
of  interactions.  This  covers  more  general  ANOVA  models  than  those  introduced 
in  Chap.  3.  This  includes  the  ANCOVA  models  where  qualitative  and  quantitative 
variables  are  both  present  in  the  explanatory  variables. 
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When  the  response  variable  is  qualitative  or  categorical  (for  instance,  an  indi¬ 
vidual  can  be  employed  or  unemployed,  a  company  may  be  bankrupt  or  not,  the 
opinion  of  one  person  relative  to  a  particular  issue  can  be  “in  favour”,  “against”  or 
“indifferent  to”,  etc.),  linear  models  have  to  be  adapted  to  this  particular  situation. 
The  most  useful  models  for  these  cases  will  be  presented  in  the  second  part  of  the 
chapter;  this  covers  the  log-linear  models  for  contingency  tables  (where  we  analyse 
the  relations  between  several  categorical  variables)  and  the  logit  model  for  quantal 
or  binomial  responses  where  we  analyse  the  probability  of  being  in  one  state  as  a 
function  of  explanatory  variables. 


8.1  General  ANOVA  and  ANCOVA  Models 

8.1.1  ANOVA  Models 

One-Factor  Models 

In  Sect.  3.5,  we  introduced  the  example  of  analysing  the  effect  of  one  factor  (three 
possible  marketing  strategies)  on  the  sales  of  a  product  (a  pullover),  see  Table  3.2. 
The  standard  way  to  present  one  factor  ANOVA  models  with  p  levels  is  as  follows 

yu  —  V  +  Oil  +  Ski,  k  =  1, . . .  ,ni,  and  l  =  1, . . . , p,  (8.2) 

all  the  Skt  being  independent.  Here  l  is  the  label  which  indicates  the  level  of  the 
factor  and  ott  is  the  effect  of  the  fth  level:  it  measures  the  deviation  from  /z,  the 
global  mean  of  y,  due  to  this  level  of  the  factor.  In  this  notation,  we  need  to 
impose  the  restriction  YHi=\  —  0  in  order  to  identify  /z  as  the  mean  of  y.  This 
presentation  is  equivalent,  but  slightly  different,  to  the  one  presented  in  Chap.  3 
(compare  with  Eq.  (3.41)),  but  it  allows  for  easier  extension  to  the  multiple  factors 
case.  Note  also  that  here  we  allow  different  sample  sizes  for  each  level  of  the  factor 
(an  unbalanced  design,  more  general  than  the  balanced  design  presented  in  Chap.  3). 

To  simplify  the  presentation,  assume  as  in  the  pullover  example  that  p  —  3.  In 
this  case,  one  could  be  tempted  to  write  the  model  (8.2)  under  the  general  form  of  a 
linear  model  by  using  three  indicator  variables 


yi  =  fl  +  OClXn  +  a2Xi2  +  Oi3Xi 3  +  Si , 


where  xu  is  equal  to  1  or  0  according  to  the  /  th  observation  and  belongs  (or  not)  to 
the  level  l  of  the  factor.  In  matrix  notation  and  letting,  for  simplicity,  ri\  —  = 

n3  —  2  we  have  with  /3  =  (/z,  ai,a2,a3)T 


y  =  X/3  +  e, 


(8.3) 
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where  the  design  matrix  X  is  given  by: 


/  1  1  0  o\ 
110  0 
10  10 
10  10 
10  0  1 
\  1  0  0  1  / 


Unfortunately,  this  type  of  coding  is  not  useful  because  the  matrix  X  is  not  of  full 
rank  (the  sum  of  each  row  is  equal  to  the  same  constant  2)  and  therefore  the  matrix 
XT X  is  not  invertible.  One  way  to  overcome  this  problem  is  to  change  the  coding 
by  introducing  the  additional  constraint  that  the  effects  add  up  to  zero.  There  are 
many  ways  to  achieve  this.  Noting  that  a 3  =  —ot\  —  0C2,  we  do  not  need  to  introduce 
o?3  explicitly  in  the  model.  The  linear  model  could  indeed  be  written  as 


yt  =  M  +  «i*ii  +  «2*i2  +  £/, 


with  a  design  matrix  defined  as 


(\  1  o\ 


A  = 


1 

1 

0 

0 

1 

1 


0 
1 
1 

1  -1  -1 


Vi— 1-1/ 

which  automatically  implies  that  a 3  =  — (oq  +  0^2) .  The  linear  model  (8.3)  is  now 
correct  with  /3  =  (/x,  oq,  oi2)T  •  The  least  squares  estimator  /3  =  ( XTX)~lXTy 
can  be  computed  providing  the  estimator  of  the  ANOVA  parameters  /x  and  — 
1 , . . . ,  3.  Any  linear  constraint  on  /3  can  be  tested  by  using  the  techniques  described 
in  Chap.  7.  For  instance,  the  null  hypothesis  of  no  factor  effect  Hq  :  a\  =  0L2  — 

0C3  —  0  can  be  written  as  H$  :  Afi  —  a ,  where  A  —  f  q  q  ^  )  an<^  a  ~  0)T 


Multiple-Factors  Models 

The  coding  above  can  be  extended  to  more  general  situations  with  many  qualitative 
variables  (factors)  and  with  the  possibility  of  interactions  between  the  factors. 
Suppose  that  in  a  marketing  example,  the  sales  of  a  product  can  be  explained  by 
two  factors:  the  marketing  strategy  with  three  levels  (as  in  the  pullover  example)  but 
also  the  location  of  the  shop  that  may  be  either  in  a  big  shopping  centre  or  in  a  less 
commercial  location  (two  levels  for  this  factor).  We  might  also  think  that  there  is  an 
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Table  8.1  A  two  factor 
ANOVA  data  set,  factor  A, 
three  levels  of  the  marketing 
strategy  and  factor  B,  two 
levels  for  the  location 


B 1 

B  2 

A 1 

18 

15 

15 

20 

25 

30 

d-2 

5 

10 

8 

12 

8 

d-3 

10 

20 

14 

25 

The  figures  represent  the  resulting 
sales  during  the  same  period 


interaction  between  the  two  factors:  the  marketing  strategy  might  have  a  different 
effect  in  a  shopping  centre  than  in  a  small  quiet  area.  To  fix  the  idea  the  data  are 
collected  as  in  Table  8.1. 

The  general  two  factor  model  with  interactions  can  be  written  as 


yijk  =  n+ai  +  Yj  +  (ocy)ij  +  sijk-,  i  =  \,...,r,  j  =  \,...,s,k  =  \,...,n.ij 

(8.4) 


where  the  identification  constraints  are: 


r  s 

=  0  and  V  Yj  =  0 
i  =  1  7  =  1 

r 


£(«y),  =  0 ,  j  =  l,...,s 

i  =  1 

s 

V(o,y),/  =  0,  i  =  1, . . .  ,r. 
j= i 


(8.5) 


In  our  example  of  Table  8.1  we  have  r  —  3  and  s  —  2.  The  a’s  measure  the 
effect  of  the  marketing  strategy  (three  levels)  and  the  y’s  the  effect  of  the  location 
(two  levels).  A  positive  (negative)  value  of  one  of  these  parameters  would  indicate 
a  favourable  (unfavourable)  effect  on  the  expected  sales;  the  global  average  of 
sales  being  represented  by  the  parameter  /x.  The  interactions  are  measured  by  the 
parameters  (ay)ij,  i  —  l, ...  ,r,  j  =  l, ...  ,s,  again  identification  constraints 
implies  the  (r  +  s)  constraints  in  (8.5)  on  the  interactions  terms. 

For  example,  a  positive  value  of  (ay)n  would  indicate  that  the  effect  of  the  sale 
strategy  A\  (advertisement  in  local  newspaper),  if  any,  is  more  favourable  on  the 
sales  in  the  location  B\  (in  a  big  commercial  centre)  than  in  the  location  B2  (not 
a  commercial  centre)  with  the  relation  (ay)n  —  —(ay)  12-  As  another  example, 
a  negative  value  of  (ay) 31  would  indicate  that  the  marketing  strategy  A 3  (luxury 
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presentation  in  shop  windows)  has  less  effect,  if  any,  in  location  type  B\  than  in  B 2: 
again  (ay)3l  =  -(ay) 32,  etc. 

The  nice  thing  is  that  it  is  easy  to  extend  the  coding  rule  for  one-factor  model 
to  this  general  situation,  in  order  to  present  the  model  a  standard  linear  model  with 
the  appropriate  design  matrix  X.  To  build  the  columns  of  X  for  the  effect  of  each 
factor,  we  will  need,  as  above,  r  —  1  (and  s  —  1)  variables  for  coding  a  qualitative 
variable  with  r  (and  s,  respectively)  levels  with  the  convention  defined  above  in 
the  one-factor  case.  For  the  interactions  between  a  r  between  a  r  level  factor  and 
a  s  level  factor,  we  will  need  (r  —  1)  x  (s  —  1)  additional  columns  that  will  be 
obtained  by  performing  the  product,  element  by  element,  of  the  corresponding  main 
effect  columns.  So,  at  the  end,  for  a  full  model  with  all  the  interactions,  we  have 
{1  +  r  —  l+s  —  1  +  (r  —  l)(s  —  1)}  =  rs  parameters  where  the  first  column  of 
l’s  is  for  the  intercept  (the  constant  /x).  We  illustrate  this  for  our  marketing  example 
where  r  —  3  and  s  —  2.  We  first  describe  a  model  without  interactions. 

1.  Model  without  interactions 

Without  the  interactions  (all  the  (oty)ij  =  0)  the  model  could  be  written  with 
3  —  (r  —  1)  +  (s  —  1)  coded  variables  in  a  simple  linear  model  form  as  in  (8.3), 
with  the  matrices: 


/18\  / 1  1  °  1  \ 


15 

110  1 

15 

110-1 

20 

110-1 

25 

110-1 

30 

110-1 

5 

10  11 

8 

,  x  = 

10  11 

8 

10  11 

10 

10  1-1 

12 

10  1-1 

10 

1-1-1  1 

14 

1-1-1  1 

20 

1  -1  -1  -1 

\25  /  \1  -1-1-1/ 


and  ft  —  (/x,  a i,  (*2,  yi)T.  Then,  —  —{ot\  +  (*2)  and  —  —y\. 

2.  Model  with  interactions 

A  model  with  interaction  between  A  and  B  is  obtained  by  adding  new  columns 
to  the  design  matrix.  We  need  2  —  (r  —  \)  x  (s  —  1)  new  coding  variables  which 
are  defined  as  the  product,  element-by-element,  of  the  corresponding  columns 
obtained  for  the  main  effects.  For  instance  for  the  interaction  parameter  (ay)n, 
we  multiply  the  column  used  for  coding  a\  by  the  column  defined  for  coding  y\ , 
where  the  product  is  element-by-element.  The  same  is  done  for  the  parameter 
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(ay) 2i.  No  other  columns  are  necessary,  since  the  remaining  interactions  are 
derived  from  the  identification  constraints  (8.5).  We  obtain 


/I  1  0  1  1  0\ 

110  110 
1  1  0-1-1  0 
1  1  0-1-1  0 
1  1  0-1-1  0 
1  1  0-1-1  0 
10  110  1 
10  110  1 
10  110  1 
1  0  1-1  0-1 
1  0  1-1  0-1 
1-1-1  1  -1  -1 
1-1-1  1  -1  -1 
1  -1  -1  -1  1  1 
V  i  — i  — i  — i  i  1/ 


with  p  —  (/j,  a\ ,  oi2,  Y\ ,  (ay) n ,  (ay  (21  )T.  The  other  interactions  can  indeed  be 
derived  from  (8.5) 


(ay)i2  =  — (ay)n 

(ay)  22  =  -(ay)  21 

(ay)3i  =  —  ((ay)n  +  (ay)2i) 

(ay)32  =  -(ay)3i. 


The  estimation  of  ft  is  again  simply  given  by  the  least  squares  solution  ft  — 
(XTX)~1XTy. 

Example  8.2  Let  us  come  back  to  the  marketing  data  provided  by  the  two-way 

/V 

Table  8. 1 .  The  values  of  /3  in  the  full  model,  with  interactions,  are  given  in  Table  8.2. 
The  -values  in  the  right  column  are  for  the  individual  tests:  it  appears  that  the 
interactions  do  not  provide  additional  significant  explanation  of  y,  but  the  effect  of 
the  two  factors  seems  significant. 

Using  the  techniques  of  Chap.  7,  we  can  test  some  reduced  model  corresponding 
to  linear  constraints  on  the  /3’s.  The  full  model  is  the  model  with  all  the  parameters, 
including  all  the  interactions.  The  overall  fit  test  Ho  :  all  the  parameters,  except  /x, 
are  equal  to  zero,  gives  the  value  ^observed  =  6.5772  with  a  p -value  of  0.0077  for 
a  F5  9,  so  that  Ho  is  rejected.  In  this  case,  the  RSSreduced  =  735.3333.  So  there  is 
some  effect  by  the  factors. 


260 


8  Regression  Models 


Table  8.2  Estimation  of  the 
two  factors  ANOVA  model 
with  data  from  Table  8.1 


p 

p -Values 

p 

15.25 

Ci\ 

4.25 

0.0218 

a2 

-6.25 

0.0033 

Yi 

-3.42 

0.0139 

(ay)  11 

0.42 

0.7922 

(ay)  21 

1.42 

0.8096 

RSSmi 

158.00 

We  then  test  a  less  reduced  model.  We  can  test  if  the  interaction  terms  are 
significantly  different  to  zero.  This  is  a  linear  constraint  on  /3  with 


( ooooioA  fo\ 

voooooij’^vo/ 


Under  the  null  we  obtain: 


/  15.3035\ 
4.0975 
-6.0440 
-3.2972 
0 

V  oj 


and  RSS reduced  =  181.8019.  The  observed  value  of  F  —  0.6779  which  is  not 
significant  (r  =  11,/  =  9)  the  p -value  =  P(/2,9  >  0.6779)  =  0.5318,  confirming 
the  absence  of  interactions. 

Now  taking  the  model  without  the  interactions  as  the  full  model,  we  can  test 
if  one  of  the  main  effects  a  (marketing  strategy)  or  y  (location)  or  both  are 
significantly  different  from  zero.  We  leave  this  as  an  exercise  for  the  reader. 


8.1.2  ANCOVA  Models 

ANCOVA  (ANalysis  of  COVAriances)  are  mixed  models  where  some  variables  are 
qualitative  and  others  are  quantitative.  The  same  coding  of  the  ANOVA  will  be  used 
for  the  qualitative  variable.  The  design  matrix  A  is  completed  by  the  columns  for 
the  quantitative  explanatory  variables  x.  Interactions  between  a  qualitative  variable 
(a  factor  with  r  levels)  and  a  quantitative  one  x  is  also  possible,  this  corresponds  to 
situations  where  the  effect  of  x  on  the  response  y  is  different  according  to  the  level 
of  the  factor.  This  is  achieved  by  adding  into  the  design  matrix  A,  a  new  column 
obtained  by  the  product,  element-by-element,  of  the  quantitative  variable  with  the 
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coded  variables  for  the  factor  (r  —  1  interaction  variables  if  the  categorical  variable 
has  r  levels). 

For  instance  consider  a  simple  model  where  a  response  y  is  explained  by  one 
explanatory  variable  x  and  one  factor  with  two  levels  (for  instance  the  gender  level 
1  for  men  and  level  2  for  women),  we  would  have  in  the  case  n\  —  ri2  —  3 

^  1  X\  1  X\  ^ 

1X2  1  X2 

%  _  1X2  1  *3 

IX4—I  — X4 
1  X5  -1  — X5 
\1  X6  -1  X6  ) 

with  ft  =  (/?i,  p2,  ^4)t.  The  intercept  and  the  slope  are  (^1+^3)  and  (/3i  +  ^4) 

for  men  and  (/3i  —  ^3)  and  (/3i  —  ^4)  for  women.  This  situation  is  displayed  in 
Fig.  8.2. 

Example  8.3  Consider  the  Car  Data  provided  in  Sect.  22.3.  We  want  to  analyse  the 
effect  of  the  weight  (IF),  the  displacement  (D)  on  the  mileage  (M).  But  we  would 
like  to  test  if  the  origin  of  the  car  (the  factor  C)  has  some  effect  on  the  response 
and  if  the  effect  of  the  continuous  variables  is  different  for  the  different  levels  of  the 
factor. 

From  the  regression  results  in  Table  8.3,  we  observe  that  only  the  weight  affects 
the  mileage,  while  the  displacement  does  not.  We  also  consider  the  origin  of  the  car, 
however,  both  the  displacement  and  the  factor  are  not  significant.  Table  8.4  is  for 
different  factor  levels. 


Fig.  8.2  A  model  with  interaction 
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Table  8.3  Estimation  of  the  effects  of  weight  and  displacement  on  the  mileage 
C;  MVAcareffect 


/y 

P 

/?-Values 

p 

/7-Values 

/X 

41.0066 

0.0000 

43.4031 

0.0000 

W 

-0.0073 

0.0000 

-0.0074 

0.0000 

D 

0.0118 

0.2250 

0.0081 

0.4140 

C 

-0.9675 

0.1250 

Table  8.4  Different  factor  levels  on  the  response  Q  MVAcareffect 


/X 

p -Values 

w 

p -Values 

D 

p -Values 

c  =  1 

40.043 

0.0000 

-0.0065 

0.0000 

0.0058 

0.3790 

c  =  2 

47.557 

0.0005 

0.0081 

0.3666 

-0.3582 

0.0160 

c  =  3 

44.174 

0.0002 

0.0039 

0.7556 

-0.2650 

0.3031 

8.1.3  Boston  Housing 

In  Chaps.  3  and  7,  linear  models  were  used  to  analyse  if  the  variations  of  the  price 
(the  variables  were  transformed  in  Sect.  1.9)  could  be  explained  by  other  variables. 
A  reduced  model  was  obtained  in  Sect.  7.3  with  the  results  shown  in  Table  7.1,  with 
r2  =  0.763.  The  model  was: 

X\4  =  /3o  +  @4X4  +  P5X5  +  PeXe  +  PsX%  +  fi9X9  +  ^10^10  +  fi\\X\\ 
-\-fi\2X\2  +  P13X13 

One  factor  (X4)  was  coded  as  a  binary  variable  (1,  if  the  house  is  close  to  the 
Charles  River  and  0  if  it  is  not).  Taking  advantage  of  the  ANCOVA  models  described 
above,  we  would  like  to  add  to  a  new  factor  built  from  the  original  quantitative 
variable  X9  =  index  of  accessibility  to  radial  highways.  So  we  will  transform  X4  as 
being  1  if  close  to  the  Charles  River  and  —1  if  not,  and  we  will  replace  X9  by  a  new 
factor  coded  X\5  —  1  if  X9  >  median(X9)  and  X\$  —  —  1  if  X9  <  median(X9).  We 
also  want  to  consider  the  interaction  of  X4  with  Xu  (proportion  of  blacks)  and  the 
interaction  of  X4  with  the  new  factor  X15.  The  results  are  shown  in  Table  8.5. 


Summary 

ANOVA  models  can  be  dividend  into  one-factor  models  and 
multiple  factor  models. 

Multiple  factor  models  analyse  many  qualitative  variables  and  the 
interactions  between  them. 


8.2  Categorical  Responses 
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Table  8.5  Estimation  of  the 
ANCOVA  model  using  the 
Boston  housing  data  Q 
MVAb os housing 


P 

p -Values 

p 

/^-Values 

Po 

32.27 

0.00 

27.65 

0.00 

Pa 

1.54 

0.00 

-3.19 

0.32 

Ps 

-17.59 

0.00 

-16.50 

0.00 

P6 

4.27 

0.00 

4.23 

0.00 

00 

-1.13 

0.00 

-1.10 

0.00 

Pi  0 

0.00 

0.97 

0.00 

0.95 

Pi  1 

-0.97 

0.00 

-0.97 

0.00 

P\2 

0.01 

0.00 

0.02 

0.01 

Pl3 

-0.54 

0.00 

-0.54 

0.00 

Pl5 

0.21 

0.46 

0.23 

0.66 

Pa*  14 

0.01 

0.13 

Pa*\5 

0.03 

0.95 

Summary  (continued) 

ANCOVA  models  are  mixed  models  with  qualitative  and  quantita¬ 
tive  variables,  and  can  also  incorporate  the  interaction  between  a 
qualitative  and  a  quantitative  variable. 


8.2  Categorical  Responses 

8.2.1  Multinomial  Sampling  and  Contingency  Tables 

In  many  applications,  the  response  variable  of  interest  is  qualitative  or  categorical, 
in  the  sense  that  the  response  can  take  its  nominal  value  in  one  of,  say,  K  classes 
or  categories.  Often  we  observe  counts  yk,  the  number  of  observations  in  category 
k  —  1, . . . ,  K.  If  the  total  number  of  observations  n  =  J2k=i  yk  is  fixed  and  we 
may  assume  independence  of  the  observations,  we  obtain  a  multinomial  sampling 
process. 

If  we  denote  by  pk  the  probability  of  observing  the  kth  category  with  ^2k= l  Pk  — 
1,  we  have  E  (Yk)  —  mk  —  npk.  The  likelihood  of  the  sample  can  then  be  written  as: 

n\  A  / mkyk 

n 


L  = 


(8.6) 
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In  contingency  tables,  the  categories  are  defined  by  several  qualitative  variables. 
For  example  in  a  (/  x  K )  two-way  table,  the  observations  (counts)  y^,  j  —  1 ,...,/ 

and  k  —  1 , . . . ,  K  are  reported  for  row  j  and  column  k.  Here  n  =  12^=1  12k = l  2A- 
Log-linear  models  introduce  a  linear  structure  on  the  logarithms  of  the  expected 
frequencies  mjk  —  E(y^)  =  npjk,  with  22j  =  \  12k = l  P.ik  ~  1-  Log-linear  structures 
on  mjk  will  impose  the  same  structure  for  the  pjk ,  the  estimation  of  the  model  will 
then  be  obtained  by  constrained  maximum  likelihood.  Three-way  tables  (J  xKx  L) 
may  be  analysed  in  the  same  way. 

Sometimes  additional  information  is  available  on  explanatory  variables  x.  In  this 
case,  the  logit  model  will  be  appropriate  when  the  categorical  response  is  binary 
( K  —  2).  We  will  introduce  these  models  when  the  main  response  of  interest  is 
binary  (for  instance  tables  (2  x  K )  or  (2  x  K  x  L)).  Further,  we  will  show  how 
they  can  be  adapted  to  the  case  of  contingency  tables.  Contingency  tables  are  also 
analysed  by  multivariate  descriptive  tools  in  Chap.  15. 


8.2.2  Log-Linear  Models  for  Contingency  Tables 

Two-Way  Tables 

Consider  a  (/  x  K)  two-way  table,  where  yjk  is  the  number  of  observations 
having  the  nominal  value  j  for  the  first  qualitative  character  and  nominal  value 
k  for  the  second  character.  Since  the  total  number  of  observations  is  fixed  n  — 
EL  Ef=i  yjk ,  there  are  JK  —  1  free  cells  in  the  table.  The  multinomial  likelihood 
can  be  written  as  in  (8.6) 


n 


J  K 


nn( 


mjk  yjk 


(8.7) 


where  we  now  introduce  a  log-linear  structure  to  analyse  the  role  of  the  rows  and 
the  columns  to  determine  the  parameters  mjk  =  E (yjk)  (or  pjk). 

1.  Model  without  interaction 

Suppose  that  there  is  no  interaction  between  the  rows  and  the  columns:  this 
corresponds  to  the  hypothesis  of  independence  between  the  two  qualitative 
characters.  In  other  words,  pjk  —  Pj  Pk  for  all  y,  k.  This  implies  the  log-linear 
model: 


lo gmjk  =  pi  +  otj  +  Yk  for  j  =  1, . . . ,  /,  k  =  1, . . . ,  K,  (8.8) 

where,  as  in  ANOVA  models  for  identification  purposes  12^=1  aj  —  12k=\  Yk  ~ 
0.  Using  the  same  coding  devices  as  above,  the  model  can  be  written  as 


log  //?  =  X/3. 


(8.9) 
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For  a  (2  x  3)  table  we  have: 


/  log  777 1 1  \ 

/I  1  1  0\ 

logmn 

110  1 

(Po\ 

log  m  — 

logmi3 

,  X  = 

1  1-1-1 

,  p  = 

Pi 

logm2i 

1-110 

Pi 

logm22 

1-10  1 

\pj 

\logm23 ) 

V -1-1-1 

where  the  first  column  of  X  is  for  the  constant  term,  the  second  column  is  the 
coded  column  for  the  2-levels  row  effect  and  the  two  last  columns  are  the  coded 
columns  for  the  3 -levels  column  effect.  The  estimation  is  obtained  by  maximising 
the  log-likelihood  which  is  equivalent  to  maximising  the  function  L(P)  in  /3: 

J  K 

L (j6)  =  Y2, 13  yJk  log  mJk '  (8- 1°) 

j=u= i 

The  maximisation  is  under  the  constraint  k  —  n.  In  summary  we  have 
1  +  (7  —  1)  +  (AT  —  1)  —  1  free  parameters  for  JK  —  1  free  cells.  The  number  of 
degrees  of  freedom  in  the  model  is  the  number  of  free  cells  minus  the  number  of 
free  parameters.  It  is  given  by 

r  =  JK  —  1  —  (7  —  1)  —  (K  —  1)  =  (7  -  1)  (K  -  1). 

In  the  example  above,  we  have  therefore  (3  —  l)x(2— 1)  =  2  degrees  of  freedom. 
The  original  parameters  of  the  model  can  then  be  estimated  as: 

Oil  =  Pi 
a2  =  -pi 

n  =  P  2 

Yi  —  Ps 

Ys  =  ~  {Pi  +  ft).  (8.11) 


2.  Model  with  interactions 

In  two-way  tables  the  interactions  between  the  two  variables  are  of  interest.  This 
corresponds  to  the  general  (full)  model 


\ogmjk  =  //  +  a tj  +Yk  +  (oiY)jk,  j  =  1,  ---,J,k  =  1, . . . ,  K,  (8.12) 
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where  in  addition,  we  have  the  J  +  K  restrictions 

K 

y^(ay)jk  =  0,  for  y  =  ,J 

k  =  1 
J 

TXarh  —  for  k  =  1, . . . ,  K  (8.13) 

j= i 

As  in  the  ANOVA  model,  the  interactions  may  be  coded  by  adding  (/  —  1)  (K  —  1) 
columns  to  A,  obtained  by  the  product  of  the  corresponding  coded  variables.  In 
our  example  for  the  (2  x  3)  table  the  design  matrix  A  is  completed  with  two  more 
columns: 


/l  1  1  0  1  o\ 

/M 

110  10  1 

Pi 

A  = 

1  1-1-1  -1  -1 

,  p  = 

Pi 

1-1  1  0-1  0 

Pi 

1-1  0  1  0-1 

Pi 

\1  -1-1-1  1  1/  \Ps/ 

Now  the  interactions  are  determined  by  using  (8.13): 

(ay)  ii  =  p4 
(ay)  12  =  Ps 

(ay)B  =  — {(ay)n  +  (ay)12}  =  ~(P4  +  Ps) 

(ay)2i  =  -(ay)n  =  ~p4 
(ay)22  =  — (ay)i2  =  ~P5 
(ay)23  =  — (ay)i3  =  p4  +  p5 

We  have  again  a  log-linear  model  as  in  (8.9)  and  the  estimation  of  /3  goes  through 
the  maximisation  in  /3  of  L(/3)  given  by  (8.10)  under  the  same  constraint. 

The  model  with  all  the  interaction  terms  is  called  the  saturated  model.  In 
this  model  there  are  no  degrees  of  freedom,  the  number  of  free  parameters  to 
be  estimated  equals  the  number  of  free  cells.  The  parameters  of  interest  are  the 
interactions.  In  particular,  we  are  interested  in  testing  their  significance.  These 
issues  will  be  addressed  below. 
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Three-Way  Tables 

The  models  presented  above  for  two-way  tables  can  be  extended  to  higher  order 
tables  but  at  a  cost  of  notational  complexity.  We  show  how  to  adapt  to  three- 
way  tables.  This  deserves  special  attention  due  to  the  presence  of  higher-order 
interactions  in  the  saturated  model. 

A  (/  x  K  x  L)  three-way  table  may  be  constructed  under  multinomial  sampling 
as  follows:  each  of  the  n  observations  falls  in  one,  and  only  one,  category  of  each 
of  three  categorical  variables  having  /,  K  and  L  modalities  respectively.  We  end 
up  with  a  three-dimensional  table  with  JKL  cells  containing  the  counts  yju  where 
n  —  k  i  yjkt  •  The  expected  counts  depend  on  the  unknown  probabilities  pju  in 
the  usual  way: 

wijki  —  ft  Pjki  >  j  —  1  ,  k  —  1 ,  •  •  • ,  K,  f  —  1 , . . . ,  L . 


1 .  The  saturated  model 

A  full  saturated  log-linear  model  reads  as  follows: 

log  rrijkt  =  n  +  oij  +  Pk  +  Yt  +  {afi)jk  +  ( ay)Jt  +  (fiy)ki  +  (aPy)jke, 

j  —  1, . . . ,  /,  k  —  1, . . . ,  K,  l  =  1, . . . ,  L.  (8.14) 


The  restrictions  are  the  following  (using  the  “dot”  notation  for  summation  on  the 
corresponding  indices): 


a(«)  —  /^(•)  —  Y(m)  —  0 
(ak)j.  =  ( ay)j .  =  (Py)k.  =  0 
=  ( ay),t  =  (fiy)*i  =  0 
(aPy)jk,  =  (a/}y)j,t  =  (aPy),ki  =  0 


The  parameters  (a/3)jk,(ay)ji,(l3y)ki  are  called  first-order  interactions.  The 
second-order  interactions  are  the  parameters  (a Py)jki,  they  allow  to  take  into 
account  heterogeneities  in  the  interactions  between  two  of  the  three  variables. 
For  instance,  let  l  stand  for  the  two  gender  categories  (L  =  2),  if  we  suppose 
that  (a/3y)jk\  —  —{oc^y)jk2  7^  0,  we  mean  that  the  interactions  between  the 
variable  J  and  K  are  not  the  same  for  both  gender  categories. 

The  estimation  of  the  parameters  of  the  saturated  model  are  obtained  through 
maximisation  of  the  log-likelihood.  In  the  multinomial  sampling  scheme,  it 
corresponds  to  maximising  the  function: 

L  =  y^yjulogmju, 

j,k,i 


under  the  constraint  k  1  mjkt  —  n  • 
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The  number  of  degrees  of  freedom  in  the  saturated  model  is  again  zero. 
Indeed,  the  number  of  free  parameters  in  the  model  is 

1  +  (/  —  1)  +  (K  —  1)  +  (L  —  1)  +  (/  —  1  )(K  -  1)  +  (/  -  1)(L  -  1) 

+  (K  -  1  )(L  -  1)  +  (/  -  1)(X  -  1)(L  —  1)  —  1  =  JKL  —  1. 

This  is  indeed  equal  to  the  number  of  free  cells  in  the  table  and  so,  there  is  no 
degree  of  freedom. 

2.  Hierarchical  non-saturated  models 

As  illustrated  above,  a  saturated  model  has  no  degrees  of  freedom.  Non-saturated 
models  correspond  to  reduced  models  where  some  parameters  are  fixed  to  be 
equal  to  zero.  They  are  thus  particular  cases  of  the  saturated  model  (8.14).  The 
hierarchical  non-saturated  models  that  we  will  consider  here,  are  models  where 
once  a  set  of  parameters  is  set  equal  to  zero,  all  the  parameters  of  higher-order 
containing  the  same  indices  are  also  set  equal  to  zero. 

For  instance  if  we  suppose  oq  =  0,  we  only  consider  non-saturated  models 
where  also  (ocy)u  —  (aft)i k  —  (wfty)iki  —  0  for  all  values  of  k  and  l .  If  we 
only  suppose  that  (aft)n  =  0,  we  also  assume  that  ( otfty)ni  —  0  for  all  l. 

Hierarchical  models  have  the  advantage  of  being  more  easily  interpretable. 
Indeed  without  this  hierarchy,  the  models  would  be  difficult  to  interpret.  What 
would  be,  for  instance,  the  meaning  of  the  parameter  (otfty)ni,  if  we  know  that 
(aft)  12  =  0?  The  estimation  of  the  non-saturated  models  will  be  achieved  by  the 
usual  way  i.e.  by  maximising  the  log-likelihood  function  L  as  above  but  under 
the  new  constraints  of  the  reduced  model. 


8.2.3  Testing  Issues  with  Count  Data 

One  of  the  main  practical  interests  in  regression  models  for  contingency  tables  is 
to  test  restrictions  on  the  parameters  of  a  more  complete  model.  These  testing  ideas 
are  created  in  the  same  spirit  as  in  Sect.  3.5  where  we  tested  restrictions  in  ANOVA 
models. 

In  linear  models,  the  test  statistics  is  based  on  the  comparison  of  the  goodness 
of  fit  for  the  full  model  and  for  the  reduced  model.  Goodness  of  fit  is  measured  by 
the  residual  sum  of  squares  (RSS).  The  idea  here  will  be  the  same  here  but  with  a 
more  appropriate  measure  for  goodness  of  fit.  Once  a  model  has  been  estimated,  we 
can  compute  the  predicted  value  under  that  model  for  each  cell  of  the  table.  We  will 
denote,  as  above,  the  observed  value  in  a  cell  by  y \  and  rhk  will  denote  the  expected 
value  predicted  by  the  model.  The  goodness  of  fit  may  be  appreciated  by  measuring, 
in  some  way,  the  distance  between  the  series  of  observed  and  of  predicted  values. 


8.2  Categorical  Responses 


269 


Two  statistics  are  proposed:  the  Pearson  chi-square  X 2  and  the  Deviance  noted  G2. 
They  are  defined  as  follows: 


2 


K 


E 


(yk  -  mk)2 

mk 


G 


2 


K 

2  X  yk  loS 

k  =  1 


(8.15) 

(8.16) 


where  K  is  the  total  number  of  cells  of  the  table.  The  deviance  is  directly  related 
to  the  log-likelihood  ratio  statistic  and  is  usually  preferred  because  it  can  be  used  to 
compare  nested  models  as  we  usually  do  in  this  context. 

Under  the  hypothesis  that  the  model  used  to  compute  the  predicted  value  is  true, 
both  statistics  (for  large  samples)  are  approximately  distributed  as  a  f2  variable 
with  degrees  of  freedom  d.f.  depending  on  the  model.  The  d.f.  can  be  computed 
as  follows: 


d.f.  —  #  free  cells  —  #  free  parameters  estimated.  (8.17) 

For  saturated  models,  the  fit  is  perfect:  X2  —  G2  —  0  with  d.f.  —  0. 

Suppose  now  that  we  want  to  test  a  reduced  model  which  is  a  restricted  version  of 
a  full  model.  The  deviance  can  then  be  used  as  the  F  statistics  in  linear  regression. 
The  test  procedure  is  straightforward: 

Hq  :  reduced  model  with  r  degrees  of  freedom 
H i  :  full  model  with  /  degrees  of  freedom.  (8.18) 

Since,  the  full  model  contains  more  parameters,  we  expect  the  deviance  to  be 
smaller.  We  reject  the  Ho  if  this  reduction  is  significant,  i.e.  if  G2Hq  —  G2Hi  is  large 
enough.  Under  Ho  one  has: 


g2Hb  -  g2Hi  ~  x;-f  ■ 

We  reject  Ho  if  the  p -value: 

P{ Ff  >  ( G2Ho  ~G2h ()|  . 

is  small.  Suppose  we  want  to  test  the  independence  in  a  (/  x  K )  two-way  table  (no 
interaction).  Here  the  full  model  is  the  saturated  one  with  no  degrees  of  freedom 
(/  =  0)  and  the  restricted  model  has  r  —  (/  —  1)  (K  —  1)  degrees  of  freedom.  We 
reject  Ho  if  the  p-v alue  of  Ho  P{/2  >  ( G2H[ )}  is  too  small. 

This  test  is  equivalent  to  the  Pearson  chi-square  test  for  independence  in  two-way 
tables  (G2H{)  %  X2Hq  when  n  is  large). 
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Table  8.6  A  three-way  contingency  table:  top  table  for  men  and  bottom  table  for  women  Q 
MVAdrug 


M 

A1 

A2 

A3 

A4 

A5 

DY 

21 

32 

70 

43 

19 

DN 

683 

596 

705 

295 

99 

F 

A1 

A2 

A3 

A4 

A5 

DY 

46 

89 

169 

98 

51 

DN 

738 

700 

847 

336 

196 

Table  8.7  Coefficient  estimates  based  on  the  saturated  model  Q  MVAdrug 


✓V 

p 

/V 

P 

A 

Po  intercept 

5.0089 

/ho 

0.0205 

A 

pi  gender:  M 

-0.2867 

Pu 

0.0482 

^>2  drug:  DY 

-1.0660 

A 

P 12  drug* age 

-0.4983 

A 

P 3  age 

-0.0080 

Pl3 

-0.1807 

P4 

0.2151 

Pu 

0.0857 

A 

Ps 

0.6607 

Pis 

0.2766 

h 

-0.0463 

A 

Pie  gender* drug* age 

-0.0134 

A 

Pi  gender*drug 

-0.1632 

Pm 

-0.0523 

A 

/3g  gender* age 

0.0713 

00 

-0.0112 

P9 

-0.0092 

P 19 

-0.0102 

Example  8.4  Everitt  and  Dunn  (1998)  provide  a  three-dimensional  (2x2x5)  count 
table  of  n  —  5,833  interviewed  people.  The  count  were  on  prescribed  psychotropic 
drugs  in  the  fortnight  prior  to  the  interview  as  a  function  of  age  and  gender.  The 
data  are  summarised  in  Table  8.6,  where  the  categories  for  the  three  factors  are  M 
for  male,  F  for  female,  DY  for  “yes”  having  taken  drugs,  DN  for  “no”  not  having 
taking  drugs  and  the  five  age  categories:  A1  (16-29),  A2  (30-44),  A3  (45-64),  A4 
(65-74),  A5  for  over  74.  The  table  provides  the  observed  frequencies  yju  in  each 
of  the  cells  of  the  three-way  table:  where  j  stands  for  gender,  k  for  drug  and  l  for 
age  categories.  The  design  matrix  A  for  the  full  saturated  model  can  be  found  in  the 
quantlet  Q  MVAdrug. 

The  saturated  model  gives  the  estimates  displayed  in  Table  8.7. 

We  see  for  instance  that  fi\  <  0,  so  there  are  fewer  men  than  women  in  the  study, 
since  /37  is  also  negative  it  seems  that  the  tendency  of  men  taking  the  drug  is  less 

A  A 

important  than  for  women.  Also,  note  that  fin  to  /3i5  forms  an  increasing  sequence, 
so  that  the  age  factor  seems  to  increase  the  tendency  to  take  the  drug.  Note  that  in  this 
saturated  model,  there  are  no  degrees  of  freedom  and  the  fit  is  perfect,  mju  —  yju 
for  all  the  cells  of  the  table. 

The  second  order  interactions  have  a  lower  order  of  magnitude,  so  we  want  to 
test  if  they  are  significantly  different  to  zero.  We  consider  a  restricted  model  where 
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Table  8.8  Coefficients  estimates  based  on  the  maximum  likelihood  method  Q  MVAdrug- 
3waysTab 


yv 

P 

yv 

P 

A 

$o  intercept 

5.0051 

A 

$g  gender* age 

0.0795 

A 

$i  gender:  M 

-0.2919 

k 

0.0321 

$2  drug:  DY 

-1.0717 

ko 

0.0265 

A 

$3  age 

-0.0030 

Pn 

0.0534 

k 

0.2358 

A 

$12  drug* age 

-0.4915 

A 

$5 

0.6649 

hi 

-0.1576 

k 

-0.0425 

J§14 

0.0917 

A 

$7  gender*drug 

-0.1734 

hs 

0.2822 

( oi/3y)jU  are  all  set  to  zero.  This  can  be  achieved  by  testing  Ho  :  p  16  —  $17  — 
$18  =  $19  =  0.  The  maximum  likelihood  estimators  of  the  restricted  model  are 
obtained  by  deleting  the  last  four  columns  in  the  design  matrix  X.  The  results  are 
given  in  Table  8.8. 

We  have  J  —  2,  K  —  2  and  L  —  5,  this  makes  JKL  —  1  =  19  free  cells.  The  full 
model  has  f  —  0  degrees  of  freedom  and  the  reduced  model  has  r  —  4  degrees  of 
freedom.  The  G2  deviance  is  given  by  2.3004;  it  has  4  degrees  of  freedom  (the  chi- 
square  statistics  is  2.3745).  The  /7-value  of  the  restricted  model  is  0.6807,  so  we  do 
not  reject  the  null  hypothesis  (the  restricted  model  without  2nd  order  interaction).  In 
others  words,  age  does  not  interfere  with  the  interactions  between  gender  and  drugs, 
or  equivalently,  gender  does  not  interfere  in  the  interactions  between  age  and  drugs. 
The  reader  can  verify  that  the  first  order  interactions  are  significant,  by  taking,  for 
instance,  the  model  without  interactions  of  the  second  order  as  the  new  full  model 
and  testing  a  reduced  model  where  all  the  first  order  interactions  are  all  set  to  zero. 
Q MVAdrug3waysTab 


8.2.4  Logit  Models 

Logit  models  are  useful  to  analyse  how  explanatory  variables  influence  a  binary 
response  y .  The  response  y  may  take  the  two  values  1  and  0  to  denote  the  presence 
or  absence  of  a  certain  qualitative  trait  (a  person  can  be  employed  or  unemployed, 
a  firm  can  be  bankrupt  or  not,  a  patient  can  be  affected  by  a  certain  disease  or  not, 
etc.).  Logit  models  are  designed  to  estimate  the  probability  of  y  —  1  as  a  logistic 
function  of  linear  combinations  of  x.  Logit  models  can  be  adapted  to  the  analysis  of 
contingency  tables  where  one  of  the  qualitative  variables  is  binary.  One  obtains  the 
probability  of  being  in  one  of  the  two  states  of  this  binary  variable  as  a  function  of 
the  other  variables.  We  concentrate  here  on  (2  x  K)  and  (2  x  K  x  L)  tables. 
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Logit  Models  for  Binary  Response 


Consider  the  vector  y  (n  x  1)  of  observations  on  a  binary  response  variable  (a  value 
of  “1”  indicating  the  presence  of  a  particular  qualitative  trait  and  a  value  of  “0”,  its 
absence).  The  logit  model  makes  the  assumption  that  the  probability  for  observing 
yt  —  1  given  a  particular  value  of  x?-  =  (xn, . . . ,  XiP)T  is  given  by  the  logistic 
function  of  a  “score”,  a  linear  combination  of  x: 


,  ,  ,  exp(£0  +  Ey=i 

p  (Xi)  =  P(  v,  =  1  I  Xj) 


I  +  exp  (Po  +  T,j= 1  PjXij) 


(8.19) 


This  entails  the  probability  of  the  absence  of  the  trait: 


1  -  P  (*i)  =  P(j(  =  o  I  Xi)  = 


1  +  exp(/Jo  +  Ey  =  i  Pjxij)  ’ 


which  implies 


P  (Xj)  I 

1  -  p(Xi)  I 


p 

Po  +  J2PjXij- 


7=1 


(8.20) 


This  indicates  that  the  logit  model  is  equivalent  to  a  log-linear  model  for  the  odds 
ratio  p(xi)/{  1  —  p  (x;)}.  A  positive  value  of  indicates  an  explanatory  variable 
Xj  that  will  favour  the  presence  of  the  trait  since  it  improves  the  odds.  A  zero  value 
of  corresponds  to  the  absence  of  an  effect  of  this  variable  on  the  appearance  of 
the  qualitative  trait. 

For  i.i.d  observations  the  likelihood  function  is: 


n 

L(P o,P)  =  Y\p(xi)yi{  1  -  p(xj)}'-yi. 

i  =  1 

The  maximum  likelihood  estimators  of  the  /3’s  are  obtained  as  the  solution  of  the 

/V  /V 

non-linear  maximisation  problem  (/30,  ft)  —  argmax^^  log  L(/30,  ft)  where 


n 

log  L(P 0,  P)  =  y]  [v,  log  p  (A',  )  +  (1  -  V/ )  log{  1  -  P  (x,  )}]  . 

i  =  1 

The  asymptotic  theory  of  the  MLE  of  Chap.  6  (see  Theorem  6.3)  applies  and  thus 
asymptotic  inference  on  ft  is  available  (test  of  hypothesis  or  confidence  intervals). 

Example  8.5  In  the  bankruptcy  data  set  (see  Sect.  22.22),  we  have  measures  on  5 
financial  characteristics  on  66  banks,  33  among  them  being  bankrupt  and  the  other 
33  still  being  solvent.  The  logit  model  can  be  used  to  evaluate  the  probability  of 
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Table  8.9  Probabilities  of 
the  bankruptcies  with  the 
logit  model 
Q  MVAbankrupt 


/V 

p 

p -Values 

Po 

3.6042 

0.0660 

Pi 

-0.2031 

0.0037 

Pa 

-0.0205 

0.0183 

Table  8.10  A  (2  x  K) 

contingency  table 


1 

. . . 

k 

. . . 

K 

Total 

1 

yu 

. . . 

yik 

. . . 

y\K 

Ji 

2 

J21 

. . . 

yik 

. . . 

yiK 

k2 

Total 

y*i 

. . . 

y*k 

. . . 

y*K 

e 

II 

• 

bankruptcy  as  a  function  of  these  financial  ratios.  We  obtain  the  results  summarised 
in  Table  8.9.  We  observe  that  only  /33  and  p>\  are  significant. 


Logit  Models  for  Contingency  Tables 

The  logit  model  may  contain  quantitative  and  qualitative  explanatory  variables.  In 
the  latter  case,  the  variable  may  be  coded  according  to  the  rules  described  in  the 
ANOVA/ANCOVA  sections  above.  This  enables  a  revisit  to  the  contingency  tables 
where  one  of  the  variables  is  binary  and  is  the  variable  of  interest.  How  can  the 
probability  of  taking  one  of  the  two  nominal  values  be  evaluated  as  a  function  of 
the  other  variables?  We  keep  the  notations  of  Sect.  8.1  and  suppose,  without  loss  of 
generality,  that  the  first  variable  with  J  —  2  is  the  binary  variable  of  interest.  In  the 
drug  Example  8.4,  we  have  a  (2  x  2  x  5)  table  and  one  is  interested  in  the  probability 
of  taking  a  drug  as  a  function  of  age  and  gender. 


(2  x  K)  Tables  with  Binomial  Sampling 

In  Table  8.10  we  have  displayed  the  situation.  Let  pk  be  the  probability  of  falling 
into  the  first  row  for  the  k- th  column,  h  —  1 , . . . ,  K.  Since  we  are  mainly  interested 
in  the  probabilities  pk  as  a  function  of  k,  we  suppose  here  that  y.k  are  fixed  for  k  — 
1, . . . ,  K  (or  we  work  conditionally  on  the  observed  value  of  these  column  totals), 
where  y.k  =  ^2]=  1  yjk-  Therefore,  we  have  K  independent  binomial  processes  with 
parameters  (y.k,  Pk )•  Since  the  column  variable  is  nominal  we  can  use  an  ANOVA 
model  to  analyse  the  effect  of  the  column  variable  on  pk  through  the  logs  of  the 
odds 


log 


1  -  Pk 


=  r]o  +  r]k,  k  =  1, . . . ,  K, 


(8.21) 
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where  J2k=i  Vk  —  0-  As  in  the  ANOVA  models,  one  of  the  interests  will  be  to  test 
H0  :  rji  =  •  •  •  =  tjk  =  0.  The  log-linear  model  for  the  odds  has  its  equivalent  in  a 
logit  formulation  for  pk 


Pk 


exp(^0  +  rjk) 

1  +  exp(^0  +  rjk)  ’ 


(8.22) 


Note  that  we  can  code  the  RHS  of  (8.21)  as  a  linear  model  X0,  where  for  instance, 
for  a  (2  x  4)  table  (K  —  4)  we  have: 


n 

1 

0 

0\ 

(Po\ 

i 

0 

1 

0 

,  e  = 

Pi 

i 

0 

0 

1 

7 

Pi 

VI 

-1 

-1 

-1/ 

ypJ 

5 


where  r]0  =  ft,  ip  =  ft,  rj2  =  Pi,  m  =  Pi  and  r\4  =  -(ft  +  ft  +  ft)-  The  logit 
model  for  pk,  k  —  l, K  can  now  be  written,  with  some  abuse  of  notation,  as 
the  K-  vector 


exp  (X0) 

P  1  +  exp(A0)  ’ 

where  the  division  has  to  be  understood  as  being  element-by-element.  The  MLE  of 
6  is  obtained  by  maximising  the  log-likelihood 

K 

L(9)  =  ^{y\k  log  Pk  +  yik  log(l  -  pk)},  (8.23) 

k=  I 

where  the  pk  are  elements  of  the  7^-vector  p. 

This  logit  model  is  a  saturated  model.  Indeed  the  number  of  free  parameters  is 
K ,  the  dimension  of  6 ,  and  the  number  of  free  cells  is  also  equal  to  K  since  we 
consider  the  column  totals  y.k  as  being  fixed.  So,  there  are  no  degrees  of  freedom 
in  this  model.  It  can  be  proven  that  this  logit  model  is  equivalent  to  the  saturated 
model  for  a  table  (2  x  K )  presented  in  Sect.  8.2.2  where  all  the  interactions  are 
present  in  the  model.  The  hypothesis  of  all  interactions  (cty)jk  being  equal  to  zero 
(independence  case)  is  equivalent  to  the  hypothesis  that  the  rjk,  k  —  1 , . . . ,  K  are 
all  equal  to  zero  (no  column  effect  on  the  probabilities  pk). 

The  main  interest  of  the  logit  presentation  is  its  flexibility  when  the  variable 
defining  the  column  categories  is  a  quantitative  variable  (age  group,  number  of 
children,  etc.).  Indeed,  when  this  is  the  case,  the  logit  model  allows  to  quantify 
the  effect  of  the  column  category  by  using  less  parameters  and  a  more  flexible 
relationship  than  a  linear  relation.  Suppose  that  we  could  attach  a  representative 
value  %k  to  each  column  category  for  this  class  (for  instance,  it  could  be  the  median 
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value,  or  the  average  value  of  the  class  category).  We  can  then  choose  the  following 
logit  model  for 


Pk 


exp(f?0  +  rj\xk) 

1  +  exp(/?o  +  f]  xxky 


(8.24) 


where  we  now  have  only  two  free  parameters  for  K  free  cells,  so  we  have  K  —  2 
degrees  of  freedom.  We  could  even  introduce  a  quadratic  term  to  allow  some 
curvature  effect  of  x  on  the  odds 

exp(/ft)  +  rjixk  +  rj2x2k) 

Pk  =  - - ; - -T-.  k  —  1, . . . ,  K. 

i  +  exp(^0  +  mxk  +  m  xy 

In  this  latter  case,  we  would  still  have  K  —  3  degrees  of  freedom. 

We  can  follow  the  same  idea  for  a  three-way  table  when  we  want  to  model 
the  behaviour  of  the  first  binary  variable  as  a  function  of  the  two  other  variables 
defining  the  table.  In  the  drug  example,  one  is  interested  in  analysing  the  tendency 
of  taking  a  psychotropic  drug  as  a  function  of  the  gender  category  and  of  the  age. 
Fix  the  number  of  observations  in  each  cell  kl  (i.e.  y%ki),  so  that  we  have  a  binomial 
sampling  process  with  an  unknown  parameter  p^i  for  each  cell.  As  for  the  two-way 
case  above,  we  can  either  use  ANOVA-like  models  for  the  logarithm  of  the  odds  and 
ANCOVA-like  models  when  one  (or  both)  of  the  two  qualitative  variables  defining 
the  K  and/or  L  categories  is  a  quantitative  variable. 

One  may  study  the  following  ANOVA  model  for  the  logarithms  of  the  odds 


log  (  —  )  =  [I  +  rn  +  ft,  k  =  \,...,K,l  =  \,...,L, 

V  i  -  Pkt  J 

with  Yj  =  £  =  0.  As  another  example,  if  xi  is  a  representative  value  (like  the  average 
age  of  the  group)  of  the  l th  level  of  the  third  categorical  variable,  one  might  think  of: 

log  (  Pkt  )  =  /x  +  Tjk  +  k  =  \,...,K,l  =  \,...,L,  (8.25) 

V I  —  Pkl  J 

with  the  constraint  r)  =  0.  Here  also,  interactions  and  the  curvature  effect  for  xt 
can  be  introduced,  as  shown  in  the  following  example.  Since  the  cell  totals  y.ki  are 
considered  as  fixed,  the  log-likelihood  to  be  maximised  is: 

K  L 

y,  y{y\ki  log  pki  +  yiki  iog(i  -  pki)},  (8.26) 

k= 1 1= 1 


where  p^i  follows  the  appropriate  logistic  model. 


276 


8  Regression  Models 


Example  8.6  Consider  again  Example  8.4.  One  is  interested  in  the  influence  of 
gender  and  age  on  drug  prescription.  Take  the  number  of  observations  for  each 
“gender- age  group”  combination,  as  fixed.  A  logit  model  (8.25)  can  be  used  for 
the  odds-ratios  of  the  probability  of  taking  drugs,  where  the  value  xi  is  the  average 
age  of  the  group.  In  the  linear  form  it  may  be  written  as  one  of  the  two  following 
equivalent  forms: 


exp  (XQ) 

1  +  exp  (XQ)  ’ 


where  6  —  (/3o,  P\ ,  fii)1  and  the  design  matrix  A  is  given  by 


/1.0  1.0  23.2 \ 

1.0  1.0  36.5 

1.0  1.0  54.3 

1.0  1.0  69.2 

1.0  1.0  79.5 

1.0  -1.0  23.2 
1.0  -1.0  36.5 
1.0  -1.0  54.3 
1.0  -1.0  69.2 
\  1-0  -1.0  79.5/ 

The  first  column  of  A  is  for  the  intercept,  the  second  is  the  coded  variable  for 
the  two  gender  categories  and  the  last  column  is  the  average  of  the  ages  for  the 
corresponding  age-group.  Then  we  estimate  /3  by  maximising  the  log-likelihood 
function  (8.26).  We  obtain: 

/30  =  -3.5612 
Pi  =  -0.3426 

j$2  =  0.0280, 

/V  /V  -A.  A. 

the  intercept  for  men  is  Pq  +  P\  =  —3.9038  and  for  women  is  /3o  —  y^i  =  —3.2186, 
indicating  a  gender  effect  and  the  common  slope  for  the  positive  age  effect  being 

/V 

@2  =  0.0280.  The  fit  appears  to  be  reasonably  good.  There  are  K  x  L  —  2  x 
5  =  10  free  cells  in  the  table.  A  saturated  “full”  model  with  ten  parameters  and 
a  zero  degree  of  freedom  would  involve  a  constant  (one  parameter)  plus  an  effect 
for  gender  (one  parameter)  plus  an  effect  for  age  (four  parameters)  and  finally  the 
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Fig.  8.3  Fit  of  the  log  of  the  odds-ratios  for  taking  drugs:  linear  model  for  age  effect  with  a  “gen¬ 
der”  effect  (no  interaction).  Men  are  the  stars  and  women  are  the  circles  Q  MVAdrug logistic 


interactions  between  gender  and  age  (four  parameters).  The  model  retained  above 
is  a  “reduced  model”  with  only  three  parameters  that  can  be  tested  against  the  most 
general  saturated  model.  We  obtain  the  value  of  the  deviance  G2H{)  —  1 1.5584  with 
7  degrees  of  freedom  (7  =  10  —  3),  whereas,  G2Hi  —  0  with  no  degree  of  freedom. 
This  gives  a  //-value  =  0.1 160,  so  we  cannot  reject  the  reduced  model. 

Figure  8.3  shows  how  well  the  model  fits  the  data.  It  displays  the  fitted  values  of 
the  log  of  the  odds-ratios  by  the  linear  model  for  the  men  and  the  women  along  with 
the  log  of  the  odds-ratios  computed  from  the  observed  corresponding  frequencies. 
It  seems  that  the  age  effect  shows  a  curvature.  So  we  fit  a  model  introducing  the 
square  of  the  ages.  This  gives  the  following  design  matrix: 


/1.0 

1.0 

23.2 

538.24  \ 

1.0 

1.0 

36.5 

1332.25 

1.0 

1.0 

54.3 

2948.49 

1.0 

1.0 

69.2 

4788.64 

1.0 

1.0 

79.5 

6320.25 

1.0  - 

-1.0 

23.2 

538.24 

1.0  -1.0  36.5  1332.25 
1.0  -1.0  54.3  2948.49 
1.0  -1.0  69.2  4788.64 
V  1.0  —1.0  79.5  6320.25/ 
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The  maximum  likelihood  estimators  are: 


/30  =  -4.4996 
Pi  =  -0.3457 
p2  =  0.0697 
j}3  =  -0.0004. 


Q MVAdruglogist ic 

The  fit  is  better  for  this  more  flexible  alternative,  giving  a  deviance  G2H  — 
3.3251  with  6  degrees  of  freedom  (6  =  10  —  4).  If  we  test  Hq:  no  curvature  for 
the  age  effect  against  H\ :  curvature  for  the  age  effect,  the  reduction  of  the  deviance 
is  G^o  —  G2Hx  —  11.5584  —  3.3251  =  8.2333  with  one  degree  of  freedom.  The 
A-value  =  0.0041,  so  we  reject  the  reduced  model  (no  curvature)  in  favour  of  the 
more  general  model  with  a  curvature  term. 

We  know  already  from  Example  8.4  that  second  order  interactions  are  not 
significant  for  this  data  set  (the  influence  of  age  on  taking  a  drug  is  the  same  for 
both  gender  categories),  so  we  can  keep  this  model  as  a  final  reasonable  model  to 
analyse  the  probability  of  taking  the  drug  as  a  function  of  the  gender  and  of  the 
age.  To  summarise  this  analysis  we  end  up  saying  that  the  probability  of  taking  a 
psychotropic  drug  can  be  modelled  as  (with  some  abuse  of  notation) 

log  (  P  p)  =  *  Sex  +  ^2  *  Age  +  Pi  *  Age2.  (8.27) 


Summary 

^  In  contingency  tables,  the  categories  are  defined  by  the  qualitative 
variables. 

^  The  saturated  model  has  all  of  the  interaction  terms,  and  0  degree 
of  freedom. 

^  The  non- saturated  model  is  a  reduced  model  since  it  fixes  some 
parameters  to  be  zero. 
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Summary  (continued) 

Two  statistics  to  test  for  the  full  model  and  the  reduced  model  are: 

K 

X1  =  -mk)2/mk 

k  =  1 

K 

G2  =  2^  yk  log  (yk/mk) 

k  =  1 


^  The  logit  models  allow  the  column  categories  to  be  a  quantitative 
variable,  and  quantify  the  effect  of  the  column  category  by  using 
fewer  parameters  and  incorporating  more  flexible  relationships 
than  just  a  linear  one. 

^  The  logit  model  is  equivalent  to  a  log-linear  model. 

p 

log  [p  (Xi)/{  1  -  p (*;)}]  =  Po  + 

j= 1 


8.3  Exercises 

Exercise  8.1  For  the  one  factor  AN  OVA  model,  show  that  if  the  model  is  “bal¬ 
anced”  (n\  =  722  =  nf),  we  have  jl  —  y.  If  the  model  is  not  balanced,  show  that 
y  —  ft  +  n\&\  +  n2&2  +  ^3^3. 

Exercise  8.2  Redo  the  calculations  of  Example  8.2  and  test  if  the  main  effects  of 
the  marketing  strategy  and  of  the  location  are  significant. 

Exercise  8.3  Redo  the  calculations  of  Example  8.3  with  the  Car  Data  set. 

Exercise  8.4  Calculate  the  prediction  interval  for  “classic  blue  ”  pullover  sales 
( Example  3.2)  corresponding  to  price  =  120. 

Exercise  8.5  Redo  the  calculations  of  the  Boston  housing  example  in  Sect.  8.1.3 

Exercise  8.6  We  want  to  analyse  the  variations  in  the  consumption  of  packs  of 
cigarettes  per  month  as  a  function  of  the  brand  (A  or  B),  of  the  price  per  pack 
and  as  a  function  of  the  gender  of  the  smoker  (M  or  F).  The  data  are  below. 
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y 

Price 

Gender 

Brand 

30 

3.5 

M 

A 

4 

4 

F 

B 

20 

4.1 

F 

B 

15 

3.75 

M 

A 

24 

3.25 

F 

A 

11 

5 

F 

B 

8 

4.1 

F 

B 

9 

3.5 

M 

A 

17 

4.5 

M 

B 

1 

4 

F 

B 

23 

3.65 

M 

A 

13 

3.5 

M 

A 

1.  In  addition  to  the  effects  of  brand,  price  and  gender,  test  if  there  is  an  interaction 
between  the  brand  and  the  price. 

2.  How  would  the  design  matrix  of  a  full  model  with  all  the  interactions  between 
the  variables  appear?  What  would  be  the  number  of  degrees  of  freedom  of  such 
a  model? 

3.  We  would  like  to  introduce  a  curvature  term  for  the  price  variable.  How  can  we 
proceed?  Test  if  this  coefficient  is  significant. 

Exercise  8.7  In  the  drug  Example  8.4,  test  if  the  first  order  interactions  are 

significant. 


Chapter  9 

Variable  Selection 


Variable  selection  is  very  important  in  statistical  modelling.  We  are  frequently 
not  only  interested  in  using  a  model  for  prediction  but  also  need  to  correctly 
identify  the  relevant  variables,  that  is,  to  recover  the  correct  model  under  given 
assumptions.  It  is  known  that  under  certain  conditions,  the  ordinary  least  squares 
(OLS)  method  produces  poor  prediction  results  and  does  not  yield  a  parsimonious 
model  causing  overfitting.  Therefore  the  objective  of  the  variable  selection  methods 
is  to  find  the  variables  which  are  the  most  relevant  for  prediction.  Such  methods  are 
particularly  important  when  the  true  underlying  model  has  a  sparse  representation 
(many  parameters  close  to  zero).  The  identification  of  relevant  variables  will  reduce 
the  noise  and  therefore  improve  the  prediction  performance  of  the  fitted  model. 

Some  popular  regularisation  methods  used  are  the  ridge  regression,  subset 
selection,  L\  norm  penalisation  and  their  modifications  and  combinations.  Ridge 
regression,  for  instance,  which  minimises  a  penalised  residual  sum  of  squares  using 
the  squared  L2  norm  penalty,  is  employed  to  improve  the  OLS  estimate  through 
a  bias-variance  trade-off.  However,  ridge  regression  has  a  drawback  that  it  cannot 
yield  a  parsimonious  model  since  it  keeps  all  predictors  in  the  model  and  therefore 
creates  an  interpretability  problem.  It  also  gives  prediction  errors  close  to  those  from 
the  OLS  model. 

Another  approach  proposed  for  variable  selection  is  the  so-called  “least  absolute 
shrinkage  and  selection  operator”  (Lasso),  aims  at  combining  the  features  of  ridge 
regression  and  subset  selection  either  retaining  (and  shrinking)  the  coefficients  or 
setting  them  to  zero.  This  method  received  several  extensions  such  as  the  Elastic 
net,  a  combination  of  Lasso  and  ridge  regression  or  the  Group  Lasso  used  when 
predictors  are  divided  into  groups.  This  chapter  describes  the  application  of  Lasso, 
Group  Lasso  as  well  as  the  Elastic  net  in  linear  regression  model  with  continuous 
and  binary  response  (logit  model)  variables. 
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9.1  Lasso 

Tibshirani  (1996)  first  introduced  Lasso  for  generalised  linear  models,  where  the 
response  variable  y  is  continuous  rather  than  categorical.  Lasso  has  two  important 
characteristics.  First,  it  has  an  L\ -penalty  term  which  performs  shrinkage  on 
coefficients  in  a  way  similar  to  ridge  regression,  where  an  L2  penalty  is  used. 

Second,  unlike  ridge  regression,  Lasso  performs  variable  subset  selection  driving 
some  coefficients  to  exactly  zero  due  to  the  nature  of  the  constraint,  where  the 
objective  function  may  touch  the  quadratic  constraint  area  at  a  corner.  For  this 
reason,  the  Lasso  is  able  to  produce  sparse  solutions  and  is  therefore  able  to  combine 
good  features  of  both  ridge  regression  and  subset  selection  procedure.  It  yields 
interpretable  models  and  has  the  stability  of  ridge  regression. 


9.1.1  Lasso  in  the  Linear  Regression  Model 


The  linear  regression  model  can  be  written  as  follows: 

y  =  Xfi  +  e. 


where  j  is  an  (n  x  1)  vector  of  observations  for  the  response  variable,  X  — 
(xj1", . . . ,  xJ)T,  Xi  e  F,  i  =  1, . . . ,  n  is  a  data  matrix  of  p  explanatory  variables, 

and  £  =  (ei, . . . ,  sn)T  is  a  vector  of  errors  where  E(e;)  =  0  and  Var (sf)  =  a2, 
i  —  1 , ,n. 

In  this  framework,  E(y\X)  =  X/3  with  /3  =  (/?i , . . . ,  /3P)T .  Further  assume  that 
the  columns  of  X  are  standardised  such  that  n~l  Y^=i  xij  —  0  and  n~l  ”=1  x\  — 

A 

1 .  The  Lasso  estimate  /3  can  then  be  defined  as  follows 


/3  =  argmin 


p 

subject  to 

j= 1 


(9.1) 


where  s  >  0  is  the  tuning  parameter  which  controls  the  amount  of  shrinkage.  For 
the  OLS  estimate  j3°  =  (XT X)~l  XT  y  a  choice  of  tuning  parameter  s  <  so,  where 
^0  =  YHj= 1  I  | ,  will  cause  shrinkage  of  the  solutions  towards  0,  and  ultimately 
some  coefficients  may  be  exactly  equal  to  0.  For  values  s  >  so  the  Lasso  coefficients 
are  equal  to  the  unpenalised  OLS  coefficients. 

An  alternative  representation  of  (9.1)  is: 


/V 


6  —  argmin 

P 


n 


EL 


-  xj  p)2  +  xj^WjiY 


(9.2) 


9.1  Lasso 


283 


with  a  tuning  parameter  A  >  0.  As  A  increases,  the  Lasso  estimates  are  continuously 
shrunk  toward  zero.  Then  if  A  is  quite  large,  some  coefficients  are  exactly  zero.  For 
A  =  0  the  Lasso  coefficients  coincide  with  the  OLS  estimate.  In  fact,  if  the  solution 

/V  /\ 

to  (9.1)  is  denoted  as  and  the  solution  to  (9.2)  as  fix,  then  VA  >  0  and  the 

/V  A  A 

resulting  solution  fix  such  that  fix  =  fisx  and  vice  versa  which  implies  a  one-to- 
one  correspondence  between  these  parameters.  However,  this  does  not  hold  if  it  is 
required  that  A  >  0  only  and  not  A  >  0,  because  if,  for  instance,  A  =  0,  then  fix  is 
the  same  for  any  s  >  ||/3 1|  i  and  the  correspondence  is  no  longer  one-to-one. 


Geometrical  Aspects  in  M2 

The  Lasso  estimate  under  the  least  squares  loss  function  solves  a  quadratic  program- 

ming  problem  with  linear  inequality  constraints.  The  criterion  YY=\  (j/  ~  x7 P) 
yields  the  quadratic  form  objective  function 

08  -  j8°)TW(j8  -  p°)  (9.3) 

with  W  —  XTX.  For  the  special  case  when  p  —  2,  /3  =  (/3i,/?2)T>  the 
resulting  elliptical  contour  lines  are  centred  around  the  OLS  estimate  and  the  linear 
constraints  are  represented  by  square  (shaded  area)  shown  in  Fig.  9.1.  The  Lasso 
solution  is  the  first  place  that  the  contours  touch  the  square,  and  this  sometimes 
occurs  at  a  corner,  corresponding  to  a  zero  coefficient.  The  nature  of  the  Lasso 
shrinkage  may  not  occur  completely  obvious.  In  the  work  by  Efron,  Hastie, 
Johnstone,  and  Tibshirani  (2004)  the  Least  Angle  Regression  (LAR)  algorithm 


Fig.  9.1  Lasso  in  the  general 
design  case  for  5  =  4  and 

OLS  estimate  /3°  =  (6,  7)T 
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with  a  Lasso  modification  was  described  which  computes  the  whole  path  of  Lasso 
solutions  and  gives  a  better  understanding  of  the  shrinkage  nature. 


The  LAR  Algorithm  and  Lasso  Solution  Paths 

The  LAR  algorithm  may  be  introduced  in  the  simple  three-dimensional  case  as 
follows  (assume  that  the  number  of  covariates  p  —  3): 

•  first,  standardise  all  the  covariates  to  have  mean  0  and  unit  length  as  well  as  make 
the  response  variable  have  mean  zero; 

A 

•  start  with  /3  =  0; 

•  initialise  the  algorithm  with  the  first  two  co variates:  let  A'  =  ( X\,X2 )  and 

yv 

calculate  the  prediction  vector  j)0  =  =  0; 

•  calculate  J2  the  projection  of  y  onto  C{x\ ,  X2),  the  linear  space  spanned  by  x\ 
and  X2 ; 

•  compute  the  vector  of  current  correlations  between  the  covariates  A  and  the 

/V  A  A 

two-dimensional  current  residual  vector:  Cyo  =  AT(y2  —  y 0)  =  (cf\c2°)T. 
According  to  Fig.  9.2,  the  current  residual  y2  —  34)  makes  a  smaller  angle  with 

-A,  A. 

X\ ,  than  with  X2,  therefore  cf*  >c|°; 

•  augment  %  in  the  direction  of  x\  so  that  y \  —  yo  +  y\  x  \  with  y\  chosen  such 

A  A 

that  cf*  =  c2°  which  means  that  the  new  current  residual  y2  —  y\  makes  equal 
angles  (is  equiangular)  with  X\  and  xp, 

•  suppose  that  another  regressor  X3  enters  the  model:  calculate  a  new  projection  y3 
of  y  onto  C(x\,  X2,  X3); 

A  AAA 

•  recompute  the  current  correlations  vector  Cyi  —  (cf1 ,  c2! ,  c31)T  with  rY  = 
(xux2,x3),y3  and  jq; 

•  augment  y\  in  the  equiangular  direction  so  that  =  y\  +  j/2^2  with  j>2 

AAA 

chosen  such  that  cj1  =  c^1  —  cp,  then  the  new  current  residual  y3  —  y3  goes 


Fig.  9.2  Illustration  of  LARS  algorithm 
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equiangularly  between  x\ ,  x2  and  X3  (here  u2  is  the  unit  vector  lying  along  the 
equiangular  direction  y2); 

•  the  three-dimensional  algorithm  is  terminated  with  the  calculation  of  the  final 
prediction  vector  y3  —  y2  -\-  Y3U3  with  y3  chosen  such  that  $3  =J3  . 

In  the  case  of  p  >  3  covariates,  y3  would  be  smaller  than  y3  initiating  another 
change  of  direction,  as  illustrated  in  Fig.  9.2. 

In  this  setup,  it  is  important  that  the  covariate  vectors  x\,  x2,  x3  are  linearly 
independent.  The  LAR  algorithm  “moves”  the  variable  coefficients  to  their  least 
squares  values.  So  the  Lasso  adjustment  necessary  for  the  sparse  solution  is  that 
if  a  nonzero  coefficient  happens  to  return  to  zero,  it  should  be  dropped  from  the 
current  (“active”)  set  of  variables  and  not  be  considered  in  further  computations. 
The  general  LAR  algorithm  for  p  predictors  can  be  summarised  as  follows. 


Least  Angle  Regression  Algorithm 

1.  The  covariates  are  standardised  to  have  mean  0  and  unit  length  1  and  the 
response  has  mean  0: 

n  n  n 

y yi  =  0,  yxij  =  0,  yxfj  =  i;  j  =  \,i,...,p. 

i  =  1  i  =  1  i  =  1 

/V  /V  /V  — 

The  task  is  to  construct  the  fit  /3  =  (/3i , . . . ,  /3P)  by  iteratively  changing 

the  prediction  vector  y  =  EUiXjPj  =  xp. 

2.  Denote  A  equal  to  a  subset  of  the  indices  {1,2,...,  p},  begin  with  y ^4  = 
y0  —  0  and  calculate  the  vector  of  current  correlations 

C  =  XT(y  -  yA). 

A 

3.  Then  review  the  current  set  A  —  {j  :  |  cj  \  —  C}  as  the  set  of 

indices  corresponding  to  the  covariates  with  the  greatest  absolute  current 

/V 

correlations,  where  C  —  max{|c7  |};  let  sj  =  sign{c7  }  for  j  e  A  and 

j  T  -1  -1 

compute  the  matrix  —  (sj  Xj  )  jeA^  the  scalar  A ^  =  (1  1  A)  2  with 

Q a  —  X~\Xa  and  1^4  being  a  vector  of  ones  of  length  \A\,  and  the  so-called 
equiangular  vector  u a  —  Xjyw a  with  WA  —  ^A  which  makes  equal 

angles,  each  less  than  90°,  with  the  columns  of  A4. 

ciof 

4.  Calculate  the  inner  product  vector  a  —  Xtua  and  the  direction 

c-ej  *  c  +  cj  j 

A  a  —  Clj  A  a  +  U  j  ( 


y  =  min+ 

jeAc 
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5.  Define  d  to  be  the  m -vector  equaling  sjwjij  for  j  e  A  and  zero  elsewhere 

and  yj  —  —fi,  / d,  yielding  y  =  min  [yj } 

Yj  >0 

(a)  If  y  <  y ,  calculate  the  next  LARS  step  as 

9a+  —  9a  +  yua 

with  A+  —  A—  {j}. 

(b)  Else:  calculate  the  next  step  as 

ju+  =  9  a  +  yua 

6.  Iterate  until  all  p  predictors  have  been  entered,  some  of  which  are  ultimately 
dropped  from  the  active  set  A. 


This  algorithm  can  be  implemented  on  a  grid  from  0  to  1  of  the  standardised 
coefficients  constraint  s  resulting  in  the  complete  paths  of  the  Lasso  coefficients 
and  illustrating  the  nature  of  Lasso  shrinkage. 

Once  the  Lasso  solution  paths  have  been  obtained,  it  is  important  to  decide  on  a 
rule  how  to  choose  the  “optimal”  solution,  or,  equally,  the  regularisation  parameter 
A.  There  are  several  existing  methods  to  do  this  and  the  most  popular  examples 
are  the  K- fold  cross-validation,  generalised  cross-validation,  Schwartz’s  (Bayesian) 
Information  Criterion  (BIC).  All  these  methods  can  be  viewed  as  degrees-of- 
freedom  adjustments  to  the  residual  squared  error  (RSE)  which  underestimates  the 
true  prediction  error 


RSE  d4f  y](y,  -  j},-)2. 

i  =  1 

Consider  the  generalised  cross-validation  statistic: 

GCV(A)  =  «_1RSEa/  {1  -  df(A )/n}2 ,  (9.4) 

where  RSE^  is  the  residual  sum  of  squares  for  the  constrained  fit  with  a  particular 
regularisation  parameter  A.  An  alternative  is  the  BIC 

BIC  =  n  log(d2)  +  log(/z)  •  df(A)  (9.5) 

with  the  estimation  of  error  variance  d2  —  n~l  Y^=\(yi  ~  J/)2- 
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The  degrees  of  freedom  of  the  predicted  vector  y  in  the  Lasso  problem  with  the 
linear  Gaussian  model  with  normally  distributed  errors  having  zero  expectation  and 
variance  cr2,  written  £/  ~  A^(0,  a2),  can  be  defined  as  follows: 

n 

df(A)  =  a~ 2  ^  Cov(  v, ,  v, ) ,  (9.6) 

i  =  1 

which  can  actually  be  used  for  both  linear  and  non-linear  models.  This  expression 
for  df(A)  can  be  viewed  as  a  quantitative  measure  of  the  prediction  error  bias 

/V 

dependence  on  how  much  each  y*  affects  its  fitted  value  y/.  The  estimate  /3 
minimising  the  GCV  statistic  can  then  be  chosen.  The  following  example  shows 
how  to  compute  df(A). 

Example  9.1  ( Calculation  of  df(A) )  As  no  closed-form  solution  exists  for  the  Lasso 
problem,  an  approximation  should  be  calculated.  The  constraint  <  s  can 

be  rewritten  as  T,P2j/\Pj\  <  .9.  Using  the  duality  between  the  constrained  and 
unconstrained  problems  and  one-to-one  correspondence  between  s  and  A,  the  Lasso 
solution  is  computed  as  the  ridge  regression  estimate 


p  =  (XTX  +  A  B~1)~1XJy, 


/\ 

where  B  —  diag(|/^y  |).  Then  it  follows  that 


y  =  xp, 

=  X(XTX  +  XB~l)~x  XT  y . 

Then,  to  calculate  Cov(y?  ,  y7  ),  one  could  use  Cov(yz  ,  y7  )  =  Cov(^  Jy.eJy)  = 
ej  Co v(y,  y)ej,  where  e\  is  a  vector  where  the  i'th  entry  is  1  and  the  rest  are  zero. 
Furthermore,  each  entry  in  the  sum  of  (9.6)  can  be  calculated  to  be 


Cov(j, ,  v, )  =  ej  Co <v(y,  y)et  (9.7) 

=  eJX(XTX  +  A B~1ylXJ  Co v(y,  y)e,  (9.8) 

=  a2{XTei)T{XT  X  +  AZ?-1)-1  (T’Te/)  (9.9) 

=  a2xJ(XTX  +  XB~l)~lXi.  (9.10) 


Using  the  fact  that  (9.10)  are  scalars  for  all  i  as  well  as  the  properties  of  the  trace  of 
a  matrix  and  matrix  multiplication  rules  mentioned  in  Chap.  2,  one  obtains  the  final 
closed-form  expression  for  the  effective  degrees  of  freedom  in  the  Lasso  problem: 

1  n 

df(A)  =  —  tr  io2xJ X  +  XB~l) 
o1 
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=  tr {xixj X  +  XB  !) 

/  =  l 

=  tr  |  £>iT)  (^T<r  +  AZ?-1)-1 

=  tr  {XTX(XTX  +  A5-1)-1} 

=  tr  {X(XTX  +  XB~l)~lXT }  . 

It  should  be  noted  that  the  formula  for  the  effective  degrees  of  freedom  derived 
above  is  valid  in  the  case  of  the  underlying  model  with  non-random  regressors. 
When  the  random  design  is  used  and  the  set  of  nonzero  predictors  is  not  fixed, 
another  estimator  should  be  used. 


Orthonormal  Design  Case 


A  computationally  convenient  special  case  is  the  so-called  orthonormal  design 
framework.  In  the  orthonormal  design  case  AT  A  is  a  diagonal  matrix  that  AT  A  = 
X.  Here  the  explicit  Lasso  estimate  is 


subject  to 


I]  I  A/ 1  = 5- 


j= i 


(9.11) 

(9.12) 


The  formula  shows  what  was  already  mentioned  in  the  beginning,  namely  that  the 
Lasso  estimate  is  a  compromise  between  subset  selection  and  ridge  regression,  the 
estimate  is  either  shrunk  by  y  or  is  set  to  zero.  As  a  consequence  Lasso  coefficients 

y\ 

can  take  values  between  zero  and  /3 j . 

Example  9.2  (Orthonormal  Design  Case  for  p  —  2)  Let  /3  = 

w.l.o.g.  be  in  the  first  quadrant,  i.e.  fix  >  0  and  /?2  >  0.  This  gives  us 
the  first  condition.  The  orthonormal  design  ensures  that  the  elliptical  contour 

lines  describe  circles  around  the  OLS  estimate.  Thus  we  get  a  linear  function 

/\ 

going  through  the  point  /r'  and  being  orthogonal  (if  possible)  to  the  first 
condition.  Equalising  both  conditions 


Pi  +  p2  =  s 


@2  —  Pi  + 


(9.13) 

(9.14) 
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the  Lasso  estimate  can  now  be  accurately  determined: 


(9.15) 

(9.16) 


For  cases  in  which 


<  0  or 


<  0  the  corresponding  Lasso 


estimates  will  always  be  zero  as  the  position  of  the  j}®  and  corresponding  contour 
lines  do  not  make  it  possible  to  get  the  orthogonality  condition  mentioned  above. 
Let  /3°  =  (6,  7)  and  tuning  parameter  s  —  4.  In  this  case  the  Lasso  estimator  is 
given  by,  as  shown  in  Fig.  9.3: 


/V 

4  6-7 

Pi  = 

2  1  2 

=  1.5, 

(9.17) 

/V 

4  6-7 

Pi  — 

2  2 

=  2.5. 

(9.18) 

Fig.  9.3  Lasso  in  the  orthonormal  design  case  for  5  =  4  and  OLS  estimate  j§°  =  (6,  7)T  Q 
MVAlassocontour 
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In  terms  of  A,  the  Lasso  solution  (9.11)  in  the  orthonormal  design  case  can  be 
calculated  in  a  usual  unconstrained  minimisation  problem.  Note  that  in  this  case  the 
least  squares  solution  is  given  by 

j}°  =  (XTX)~1XTy  =  XTy. 


Then  the  minimisation  problem  is  written  as 


P  =  arg  mm\\y-XP\\22  +  A||/3||i 

p  GR7 

P 

=  arg  min(y  -  XP)T(y  -  XP)  +  A  ^  |  p} 

BeRP 

7=1 

P 

—  arg  min  —  2yr XB  +  /3T/3  +  A  Y^  1/3/ 1 

/3eRP  J 

7=1 

P 

—  arg  min  —  2 /3OT/3  +  /3T/3  +  A  Y^  |/3 / 1 

7=1 

=  ar§  E  (-2A°^i  +  #  +  Al^'  l)  • 

7=1 


The  objective  function  can  now  be  minimised  by  separate  minimisation  of  its  yth 
element.  To  solve 


min(— 2/3°/3  +  p2  -  X\P\),  (9.19) 

/V 

where  the  index  j  was  dropped  for  simplicity,  let’s  first  assume  that  /3°  >  0, 
then  /3  >  0,  because  a  lower  value  for  the  objective  function  may  be  obtained  by 
changing  the  sign.  Then  the  solution  for  the  modified  problem 

min(-2£°£  +  62  +  A  6)  (9.20) 

is,  obviously,  /3  =  /3°  —  y,  where  y  =  A/2,  as  in  (9.11).  To  ensure  the  sign 
consistency  for  this  case,  one  could  see  that  the  solution  is 

P  =  (P°  -  K)+  =  sign(4°)(|4°l  -  y)+ ■  (9.21) 

Now  let  us  take  /3' <  0,  then  /3  <  0  as  well  and  the  solution  for  the  new  problem 

min(-2 P°P  +  P2  -  A  p) 


(9.22) 
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r\ 

is  ft  =  j£r  +  y,  but  the  sign  consistency  requires  that 

P  =  (P°  +  Y)~ 

=  -(-P°  ~  Y)+ 

=  sign(^°)(|^°|  -  y)+ ■ 

As  the  solutions  are  the  same  in  both  cases,  the  expression  sign(/3°)(|/3°|  —  y)+  is 
indeed  the  solution  to  the  original  Lasso  problem. 


General  Lasso  Solution 

For  a  fixed  s  >  0  the  Lasso  estimation  problem  is  a  least  squares  problem 
subjected  to  2P  linear  inequality  constraints  as  there  are  2P  different  possible  signs 

for  ft  —  . . . ,  PP)T .  Lawson  and  Hansen  (1974)  suggested  solving  the  least 

squares  problem  subject  to  a  general  linear  inequality  constraint  Gft  <  h  where 
G(m  x  p )  corresponds  to  the  m  —  2P  constraints  and  h  —  s\m.  As  m  could  be 
very  large,  this  procedure  is  not  very  fast  computationally.  Therefore  Lawson  and 
Hansen  (1974)  introduced  the  inequality  constraints  sequentially  in  their  algorithm, 
seeking  a  feasible  solution. 

Let  g(P)  =  Y^i= i  (j7  —  x7 P)  an<^  1  et8k,k  —  1, . . . ,  2P,  be  column  vectors  of 
^-tuples  of  the  form  (±1, . . . ,  ±1).  It  follows  that  the  linear  inequality  condition 
can  be  equivalently  described  as  8  J  ft  <  s,  k  —  ,2P .  Now  let  E  —  {k\8j  ft  — 

the  equality  set,  the  number  of  elements  of  E  and  Ge  —  (^J)keE  a  matrix 
whose  rows  are  all  s  for  k  e  E.  Now  the  algorithm  works  as  follows,  see 
Tibshirani  (1996): 

1.  Find  OLS  estimate  /3°  and  let  8ko  =  sign(^°),  E  —  {ko}. 

2.  Find  to  minimise  g(/3)  subject  to  GEP  <  s\mE. 

T)  ^ 

3-  ifE;=ii&i<  s  the  computation  is  complete. 

A  A 

4-  IfELil^l>  ^  add  k  to  the  set  E  where  8k  —  sign(/3)  and  go  back  to  step  2. 
5.  The  final  iteration  is  a  solution  to  the  original  problem. 

As  the  number  of  steps  is  limited  by  m  —  2P ,  the  algorithm  has  to  converge 
in  finite  time.  The  average  number  of  iterations  in  practice  is  between  0.5/7  and 
0.75/7. 

Example  9.3  Let  us  consider  the  car  data  set  (Table  22.3)  where  n  —  74.  We 
want  to  study  in- what  way  the  price  (X\)  depends  on  the  12  other  variables 
( X2 ), . . . ,  (X13),  which  are  represented  by  j  —  1, 2, . . . ,  12,  using  Lasso  regres¬ 
sion.  In  Fig.  9.4  one  can  clearly  see  that  coefficients  become  nonzero  one  at  a 
time,  that  means  the  variables  enter  the  regression  equation  sequentially  as  the 
scaled  shrinkage  parameter  s  =  ,s/||/3u||i  increases,  in  order  j  =  6, 11,  9,  3, . . . 
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0123  58  9  10  12 


Fig.  9.4  Lasso  estimates  of  standardised  regression  /3  j  for  car  data  with  n  =  74  and  p  =  12  Q 
MVAlassoregress 


(representing  X7,  Xu,  X\q,  X4, . . .),  hence  the  L\  penalty  results  in  variable  selec¬ 
tion  and  the  variables  which  are  most  relevant  are  shrunk  less.  In  this  example, 
an  optimal  s  can  be  found  such  that  the  fitted  model  gives  the  smallest  residual 
(see  Exercise  9.3). 


9.1.2  Lasso  in  High  Dimensions 


The  problem  with  the  algorithm  by  Tibshirani  to  calculate  the  Lasso  solutions  is  that 
it  is  initialised  from  an  OLS  solution  of  the  unconstrained  problem  which  does  not 
correspond  to  the  true  model.  Another  problem  is  that  for  the  case  of  p  >  n,  this 
computation  is  infeasible.  Therefore  it  may  be  optimal  to  start  with  a  small  initial 
guess  for  ft  and  iterate  through  a  different  kind  of  an  algorithm  to  obtain  the  Lasso 
solutions.  Such  an  algorithm  is  based  on  the  properties  of  the  Lasso  problem  as 
a  convex  programming  one.  Osborne  et  al.  (2000)  showed  that  the  original  Lasso 
estimate  problem  (9.1)  can  be  rewritten  as: 


P  —  arg  min  -  (y  —  X  ft)T  (y  —  Xft)  =f  ^rT r,  subject  to  s  —  ||/3||i  >  0, 


(9.23) 


where  r  =  (y  —  Xfi).  Let  J  —  {i\, . . . ,  ip}  be  the  set  of  indices  such  that 
| r)tj  |  =  ||ATr  ||oo,  for  j  —  1, . . . ,  p\  so  indices  in  J  correspond  to  nonzero 
elements  of  ft.  Also  let  P  be  the  permutation  matrix  that  permutes  the  elements 
of  the  coefficient  vector  ft  so  that  the  first  elements  are  the  nonzero  elements: 
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P  —  PT  (/3j,  0)T.  Denote  Oj  —  sign(/3j)  be  equal  to  1  if  the  correspond¬ 
ing  element  of  is  positive  and  —1  otherwise.  Further  denoting  / (J3)  — 
(y  —  X/3)T  (y  —  Xf5)  the  following  optimisation  algorithm  is  based  on  the  local 
linearisation  of  (9.1)  around  /3: 

ft  =  arg  min  /(/3  +  h),  subject  to  0j(/3j  +  hj)<s  and  h  —  PT  (h j,  0)T  , 

h 

(9.24) 


the  solution  for  which  can  be  shown  to  be  equal  to 


hj  =  (XjXj)  l{Xj(y  ~  Xjfij)  —  fiQj}, 


where 


/ 1  —  max 


The  procedure  as  a  whole  is  implemented  as  shown  in  the  “Lasso  solution-path 
optimisation”  algorithm.  As  shown  in  the  algorithm,  indices  may  enter  and  leave  the 
set  J,  which  makes  the  Lasso  problem  similar  to  other  subset  selection  techniques. 
Moreover,  one  can  compute  the  whole  path  of  Lasso  solutions  for  0  <  s  <  so,  each 
time  taking  the  solution  for  the  previous  s  as  a  starting  point  for  the  next  one. 


9.1.3  Lasso  in  Logit  Model 


The  Lasso  model  can  be  extended  to  generalised  linear  models,  one  of  the  most 
common  of  which  is  the  logistic  regression  (logit)  model.  Coefficients  in  the  logit 
model  have  probabilistic  interpretation.  In  the  logit  model,  the  linear  predictor  Xfi 
is  related  to  the  conditional  mean  ji  of  the  response  variable  y  via  the  logit  link 
log{/z/(l  —  /x)}.  As  the  response  variable  is  binary,  it  is  binomial-distributed  and 
l±  —  p(xi).  Therefore,  as  defined  in  (9.25),  the  logit  model  for  y  e  {0,  1}  of  (n  x  1) 
observations  on  a  binary  response  variable  and  —  (xn , . . . ,  xip)T  is, 


P(*i) 

1  -  p  (Xi) 


where 


p  (xt)  =  PCv/  =  1  |  xt)  = 


exP(Ey=!  ft/y/) 

1  +  exp(jy=1  pjXij) 


(9.25) 
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Algorithm  Lasso  solution-path  optimisation 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 


procedure  FiND(optimal  f$) 

Choose  initial  /3  and  J  (e.g.  P  <—  0,  J  <—  0) 

repeat 

Solve  (9.23)  to  obtain  h 
Set  P  P  +  h 
if  sign(^)  =  Qj  then 

A 

Obtain  the  solution  ft  =  ft 

else 

repeat 

Find  the  smallest  y,0<y  <  l,  k€j  such  that  0  =  Pk  +  yhk 
Set  P  =  p  +  yh 
Set  6k  =  —Ok 

Solve  (9.23)  again  to  obtain  a  new  h 
if  0j(Pj  +  hj)  <  s  then 

P  =  P  +  h 

else 

Update  J  J-k 
Recompute  Pj,6j,h 

end  if 

/V 

until  sign (Pj)  =  6j 

end  if 

Compute  v  A’Tf/|| Xj-r  ||oo  =  PT  (v i,  V2)T  >  here  r  —  y  —  Xpy 

if  —1  <  (vi)i  <  1  for  1  <  i  <  p  —  \  j\  then 

/V 

P  is  a  solution 

else 

Find  j  such  that  |  (^2)7 1  is  maximised 
Update  J  <r-  (J,j) 

Update  Pj  {Pj,  0)t 

Update  6j  (6j,  sign(u2)7)T 

end  if 

Set  P  P 

until  —  1  <  (v2)i  <  1  for  1  <  1  <  p  —  \  J\ 

end  procedure 


The  Lasso  estimate  for  the  logit  model  is  obtained  by  solving  the  following 
optimisation  problem: 


n  J  p 

X!  S  {-yiX P)  [  ,  subject  to  X  \Pi  \  -  s >  (9-26) 

i= 1  )  7=1 

with  tuning  parameter  s  >  0  and  log-loss  function  g{u )  =  log  { 1  +  exp (u)}.  An 

/V 

alternative  representation  of  the  Lasso  estimate  /3  in  the  logit  model  is: 

argrnin  jx  S  (-}’<  X P)  +  A  X  \Pj  I  j  ■  (9-27) 


B  —  argrnin 

P 
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Shevade  and  Keerthi  (2003)  developed  a  simple  and  efficient  algorithm  to  solve 
the  optimisation  in  (9.27)  based  on  the  Gauss-Seidel  method  using  coordinate- 
wise  descent  approach.  The  algorithm  is  asymptotically  convergent  and  easy  to 
implement.  Firstly,  define  the  following  terms, 


Ui 


L  =  E 


exp(w,) 

- 7TT — \y>x'J 

exp(l  +  ui) 


(9.28) 


The  first  order  optimality  conditions  for  (9.27)  are: 


Fj 

Fj 

Fj 

-A  <  Fj 

A  new  variable  is  defined 


=  0 

if 

7=0, 

=  A 

if 

Pj  >  0 ,  j  >  0, 

=  -A 

if 

Pj  <  0,  j  >  0, 

<  A 

if 

Pj  =  0  ,j>  0. 

vi  =  I Fj  I 
=  I*  -  Fj 
=  |A  +  Fj 


if  j  =  0, 
if  fij  >  0,  j  >  0, 

if  Pj  <  0,  j  >  0, 

if  Pj  =  0,  j  >  0. 


where  \j/j  —  max{  (Fj  —  A) ,  (—A — Fj  ) ,  0} .  Thus,  the  first-order  optimality  conditions 
can  be  written  as 


Vj  =  0  V  j.  (9.29) 

It  is  difficult  to  obtain  exact  optimality  condition,  so  the  stopping  criterion  for  (9.27) 
is  defined  as  follows  (for  some  small  s), 

vj  <  s  V  j.  (9.30) 

To  write  the  algorithm,  let  us  define  Iz  —  {j  :  Pj  =  0,  j  >0}  and  Inz  = 
{j  :  Pj  /  0,  j  >0}  for  sets  of  zero  estimates  and  sets  of  nonzero  estimates, 
respectively,  and  I  —  L  U  Inz.  The  algorithm  consists  of  two  loops.  The  first  loop 
runs  over  the  variables  in  Iz  to  choose  the  maximum  violator,  v.  In  the  second  loop 
W  is  optimised  with  respect  to  /3V ,  therefore  the  set  Inz  is  modified  and  maximum 
violator  in  Inz  is  obtained.  The  second  loop  is  repeated  until  no  violators  are  found 
in  Inz.  The  algorithm  alternates  between  the  first  and  second  loop  until  no  violators 
exist  in  both  Iz  and  Inz. 
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Algorithm  Lasso  in  logit  model 

/V 

1:  procedure  FiND(optimal  Lasso  estimate  /3) 

2:  Set  /3j  =  0  for  all  j 

3:  while  an  optimality  violator  exists  in  Iz  do 

4:  Find  the  maximum  violator  (u)  in  Iz 

5 :  repeat 

6:  Optimise  W  with  respect  to  f$v 

7 :  Find  the  maximum  violator  (v)  in  Inz 

8:  until  no  violator  exists  in  Inz 

9:  end  while 

10:  end  procedure 


Another  way  to  obtain  the  lasso  estimate  in  the  logit  model  is  by  maximising  the 
likelihood  function  of  logit  model  with  lasso  constraint.  The  log-likelihood  function 
of  logit  model  is  written  as 

n 

log  L(P)  =  ^2  [>’'  loS  P(*i)  +  (!  -  >',  )  logfl  -  p(xi)}\ .  (9.31) 

/  =  1 


Suppose  l(/3)  —  1  ogL(/3),  with  /3  =  (fii , . . . ,  /3P)T,  the  Lasso  estimates  are 
obtained  by  maximising  the  penalised  log  likelihood  for  logit  model  as  follows 


P 


—  argmax 


77  'j  P 

,  subjectto  ^  \/3j  \  <s. 

7=1  )  7=1 


(9.32) 


It  can  solved  by  a  general  non-linear  programming  procedure  or  by  using  iteratively 
reweighted  least  squares  (IRLS).  Friedman,  Hastie,  and  Tibshirani  (2010)  developed 
an  algorithm  to  solve  the  problem  in  (9.32).  An  alternative  representation  of  the 
Lasso  problem  is  defined  as  follows: 


/3  =  argmax 


77  P 


7=1 


7=1 


(9.33) 


Example  9.4  Following  Example  9.3,  the  price  (Ai)  of  car  data  set  (Table  22.3)  has 
average  6,192.28.  We  now  define  a  new  categorical  variable  which  takes  the  value 
0  if  X\  <  6,000  and  otherwise  is  equal  to  1 .  We  want  to  study  in  what  way  the  price 
(Xi)  depends  on  the  12  other  variables  (X2, ...  ,X  13)  using  Lasso  in  logit  model. 

In  Fig.  9.5  one  can  see  that  coefficients’  dynamics  depends  on  the  shrinkage 

/V 

parameter  s  =  ||/3  (A)  ||  1,  the  L\  norm  of  estimated  coefficients.  An  optimal  s  can  be 
chosen  such  that  the  fitted  model  gives  the  smallest  residual  (see  Exercise  9.4). 
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Fig.  9.5  Lasso  estimates  /37  of  logit  model  for  car  data  with  n  =  74  and  p  =  12  Q 
MVAlassologit 


9.2  Elastic  Net 

Although  the  Lasso  is  widely  used  in  variable  selection,  it  has  several  drawbacks. 
Zou  and  Hastie  (2005)  stated  that: 

1.  if  p  >  n ,  the  Lasso  selects  at  most  n  variables  before  it  saturates; 

2.  if  there  is  a  group  of  variables  which  has  very  high  correlation,  then  the  Lasso 
tends  to  select  only  one  variable  from  this  group; 

3.  for  usual  n  >  p  condition,  if  there  are  high  correlations  between  predictors, 
the  prediction  performance  of  the  Lasso  is  dominated  by  ridge  regression,  see 
Tibshirani  (1996). 

Zou  and  Hastie  (2005)  introduced  the  Elastic  net  which  combines  good  features 
of  the  Li-norm  and  L2-norm  penalties.  The  Elastic  net  is  a  regularised  regression 
method  which  overcomes  the  limitations  of  the  Lasso.  This  method  is  very  useful 
when  p  »  n  or  there  are  many  correlated  variables.  The  advantages  are:  (1)  a  group 
of  correlated  variables  can  be  selected  without  arbitrary  omissions,  (2)  the  number 
of  selected  variables  is  no  longer  limited  by  the  sample  size. 
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9.2.1  Elastic  Net  in  Linear  Regression  Model 

We  describe  the  Elastic  net  in  linear  regression  model.  For  simplicity  reason  we 
assume  that  the  are  standardised  such  that  Y^=i  xij  —  0  and  n~{  Y^i=i  xfj  —  1- 
The  Elastic  net  penalty  Pa  (/3)  leads  to  the  following  modification  of  the  problem  to 

A 

obtain  the  estimator  /3 


where 


arg  min 

P 


(2«)  1  E  E  “  P)2  +  XPa 

i  =  1 


(9.34) 


(9.35) 


The  penalty  Pa(P)  is  a  compromise  between  ridge  regression  and  the  Lasso.  If 
a  —  0  then  the  criterion  is  the  ridge  regression  and  if  o'  =  1  the  method  will  be  the 
Lasso.  Practically,  for  small  s  >  0,  the  Elastic  net  with  a  =  1  —  s  performs  like  the 
Lasso,  but  removes  degeneracies  and  erratic  variable  selection  behaviour  caused  by 
extreme  correlation.  Given  a  specific  A,  as  a  increases  from  0  to  1,  the  sparsity  of 
the  Elastic  net  solutions  increases  monotonically  from  0  to  the  sparsity  of  the  Lasso 
solutions. 

The  Elastic  net  optimisation  problem  can  be  represented  as  the  usual  Lasso 
problem,  using  modified  X  and  y  vectors,  as  shown  in  the  following  example. 

Example  9.5  To  turn  the  Elastic  net  optimisation  problem  into  the  usual  Lasso 
one,  one  should  first  augment  y  with  p  additional  zeros  to  obtain  y  —  (y,0)T. 
Then,  augment  X  with  the  multiple  of  the  p  x  p  identity  matrix  VXal  to  get 

X  =  (^XT ,  .  Next,  define  A  =  A(1  —  a)  and  solve  the  original  Lasso 

minimisation  problem  in  terms  of  the  new  input  y,  X  and  A.  This  new  problem  is 
equivalent  to  the  original  Elastic  net  problem: 


y-xp\\22  +  imu  - 


\y~ 

xp  1 

L°_ 

_  Vlalpj 

+  A(1  —  a)||^|| 


=  Ib-^lll-Aall^lli  +  All^lh-Aall^ll 

=  y-xpwi  +  x^mi  +  ii-amu}, 


which  is  equivalent  to  the  original  Elastic  net  problem. 

We  follow  the  idea  of  Friedman  et  al.  (2010)  who  used  a  coordinate  descent 
algorithm  to  solve  the  optimisation  problem  in  (9.34).  Let  us  suppose  to  have 


9.2  Elastic  Net 


299 


estimates  pk  for  k  ^  j .  Then  we  optimise  (9.34)  partially  with  respect  to  ftj  by 
computing  the  gradient  at  ft  j  —  ft  j ,  which  only  exists  if  jij  ^  0.  Having  the  soft- 
thresholding  operator  S(z,  y)  as 


sign(z)  (|z|  -  y)+ 


z—y 
<  z  +  y 

0 


if  z  >  0  and 
if  z  <  0  and 
if  y  >  |z|. 


y  <  z 
y  <\z 


(9.36) 


it  can  be  shown  that  the  coordinate- wise  update  has  the  following  form 


5  {«  lEi=ixv(yi  -  yj))  ’Xa} 
1  +  A(1  —  a) 


(9.37) 


where  y\j)  —  xikPk  is  a  fitted  value  which  excludes  the  contribution  Xy, 

therefore  —  y z- 7 )  is  partial  residual  for  fitting  . 

The  algorithm  computes  the  least  square  estimate  for  the  partial  residual  — 
y^\  then  applies  the  soft- thresholding  rule  to  perform  the  Lasso  contribution  to 
the  penalty  Afterwards,  a  proportional  shrinkage  is  applied  to  ridge  penalty. 

There  are  several  methods  used  to  update  the  current  estimate  /3 .  We  describe  the 
simplest  updating  method,  the  so-called  naive  update. 

The  partial  residual  can  be  rewritten  as  follows: 


yi  -  y\] )  =  j;  -  9i  +  xyP j 

=  n  +  XyP  j , 


(9.38) 


with  y  being  the  current  fit  and  rt  the  current  residual.  As  Xj  is  standardised, 
therefore 


i  =  1  i  =  \ 

Note  that  the  first  term  on  the  right-hand  side  of  the 
gradient  of  the  loss  with  respect  to  /3y . 

9.2.2  Elastic  Net  in  Logit  Model 

The  Elastic  net  penalty  can  similarly  be  applied  to  the  logit  model.  Recall  the  log- 
likelihood  function  of  the  logit  model  in  (9.31), 

77 

log  L(P)  =  y  [yt  log  p(xi)  +  (1  -  ji)log{l  -  p  (*;)}] . 


+  Pi  ■  (9.39) 

new  partial  residual  is  the 
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Penalised  log-likelihood  for  the  logit  model  using  Elastic  net  has  the  following  form 

max  I  «_1  -  XPa(P)\  ,  (9.40) 

P  (  i=i  ) 

with  —  1  ogL(/3).  The  solution  of  (9.40)  can  be  found  by  means  of  a 

Newton  algorithm.  For  a  fixed  A  and  given  a  current  parameter  /3,  the  quadratic 
approximation  (Taylor  expansion)  is  updated  about  current  estimates  /J  as  follows: 

n 

Iq(P)  =  — (2«)-1  V Wj ( n  -xjfi)2  +  C(/3)2,  (9.41) 

i  =  1 


where  working  response  and  weight,  respectively,  are: 


Zi 


xr  a  +  y<  ~  pm 

lP  p(Xi){\  -  p{Xi)Y 


Wi  =  p(Xi){  1  -  p(Xi)}. 


A  Newton  update  is  obtained  by  minimising 

Friedman  et  al.  (2010)  proposed  similar  approach  creating  an  outer  loop  for 
each  value  of  A,  which  computes  a  quadratic  approximation  in  (9.41)  about  current 
estimates  /3.  Afterwards,  a  coordinate  descent  algorithm  is  used  to  solve  the 
following  penalised  weighted  least  squares  problem  (PWFS) 

min  {-lQ(P)  +  XPa(P)}.  (9.42) 

ft 

This  inner  coordinate  descent  loop  continues  until  the  maximum  change  in  (9.42)  is 
less  than  a  very  small  threshold. 


9.3  Group  Lasso 

The  Group  Fasso  was  first  introduced  by  Yuan  and  Fin  (2006)  and  was  motivated 
by  the  fact  that  the  predictor  variables  can  occur  in  several  groups  and  one  could 
want  a  parsimonious  model  which  uses  only  a  few  of  these  groups.  That  is,  assume 
that  there  are  K  groups  and  the  vector  of  coefficients  is  structured  as  follows 

PG  -  (Pj , . . . ,  pl)T  eR&ft, 

where  pk  is  the  coefficient  vector  dimension  of  the  Ath  group,  k  —  1, . . . ,  K.  A 
sparse  set  of  groups  is  produced,  although  within  each  group  either  ah  entries  of  /3^, 
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k  —  1, . . . ,  K,  a  corresponding  element  of  the  whole  vector  fa  are  zero  or  all  of 
them  are  nonzero.  The  Group  Lasso  problem  can  be  formulated  in  general  as 


arg  min  n 

fieRZkPk 


K 

y  -^XkPk 

k=  1 


+  ^  ^2  VPkWPkWl, 

2  k  =  \ 


(9.43) 


where  44  is  the  k th  component  of  the  matrix  X  with  columns  corresponding  to 
the  predictors  in  the  group  k,  fa  is  the  coefficient  vector  for  that  group  and  pk  is 
the  cardinality  of  the  group,  i.e.  the  size  of  the  coefficient  vector  which  serves  as 
a  balancing  weight  in  the  case  of  widely  differing  group  sizes.  It  is  obvious  that  if 
groups  consist  of  single  elements,  i.e.  pk  —  1  Vfc,  then  the  Group  Lasso  problem  is 
reduced  to  the  usual  Lasso  one. 

The  computation  of  the  Group  Lasso  solution  involves  calculating  the  necessary 
and  sufficient  subgradient  KKT  conditions  for  fa  =  (fa  , . . . ,  fa)T  to  be  a 
solution  for  (9.43) 

-*4-sH+tat=o'  <9  44> 

if  fa  ^  0;  otherwise,  for  fa  =  0,  it  holds  that 


Xk\y-T,xiPi 

l^k 


<  A  yfpk. 


(9.45) 


Expressions  (9.44)  and  (9.45)  allow  to  calculate  the  solution,  the  so-called  update 
step  which  can  be  used  to  implement  an  iterative  algorithm  to  solve  the  prob¬ 
lem  (9.43).  The  solution  resulting  from  the  KKT  conditions  is  readily  shown  to 
be  the  following: 


xspSkV  +  xj  fa 


(9.46) 


✓V  /v  dcf  ^ 

where  the  residual  f/c  is  defined  as  rk  =  y  —  Xi  fa .  As  a  special  (orthonormal) 

case,  when  Xj  Xi  —  X,  the  solution  is  simplified  to  the  fa  —  (kfapk\\fa\\~  + 
1  )X^rji.  To  obtain  a  full  solution  to  this  problem,  Yuan  and  Lin  (2006)  suggest 
using  a  blockwise  coordinate  descent  algorithm  which  iteratively  applies  the 
estimate  (9.46)  to  k  = 

Meier,  van  de  Geer,  and  Biihlmann  (2008)  extended  the  Group  Lasso  to  the  case 
of  logistic  regression  and  demonstrated  convergence  of  several  algorithms  for  the 
computation  of  the  solution  as  well  as  outlined  consistency  results  for  the  Group 
Lasso  logit  estimator.  The  general  setup  for  that  model  involves  a  binary  response 
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variable  yi  e  {0, 1}  and  K  groups  predictor  variable  x7  =  . . . ,  xJ)T,  both 

Xj  and  yt  are  i.i.d.,  i  =  1 , ,n.  Then  the  logistic  linear  regression  model  may  be 
written  as  before: 


log 


P(Xj) 


K 


1  -  p(Xi) 


=  =  Po + y2,xJkPk, 


(9.47) 


k=  1 


where  the  conditional  probability  p(xj )  =  P(y7  =  l|x7).  The  Group  Lasso  logit 

/V 

estimator  /3  then  minimises  the  objective  function 


K 


P  —  arg  min 

0eRP+l 


-i(/3)  +  xJ2Vn\\M2\  , 


(9.48) 


k  =  1 


where  t  (•)  is  the  log-likelihood  function 

n 

Z(P)  =  I>^)-log[l  +  exp{/](x,)}]. 

i  =  1 


The  problem  is  solved  through  a  group- wise  minimisation  of  the  penalised  objective 
function  by,  for  example,  the  block-coordinate  descent  method. 

Example  9.6  The  Group  Lasso  results  can  be  illustrated  by  an  application  to  the 
MEMset  Donor  dataset  of  human  donor  splice  sites  with  a  sequence  length  of  7 
base  pairs.  The  full  dataset  (training  and  test  parts)  consists  of  12.623  true  (y?  =  1) 
and  269.155  false  (yt  =  0)  human  donor  sites.  Each  element  of  data  represents  a 
sequence  of  DNA  within  a  window  of  the  splice  site  which  consists  of  the  last  three 
positions  of  the  exon  and  first  4  positions  of  the  intron;  so  the  strings  of  length  7 
are  made  up  of  4  characters  A,  C,  T,  G  and  therefore  the  predictor  variables  are  7 
factors,  each  having  4  levels.  False  splice  sites  are  sequences  on  the  DNA  which 
match  the  consensus  sequence  at  position  four  and  five.  Figure  9.6  shows  how  the 
Group  Lasso  does  shrinkage  on  the  level  of  groups  built  by  DNA  letters. 

As  is  seen  from  Example  9.6,  the  solution  to  the  Group  Lasso  problem  yields  a 
sparse  solution  only  regarding  the  “between”  case,  that  is,  it  excludes  some  of  the 
groups  from  the  model  but  then  all  coefficients  in  the  remaining  groups  are  nonzero. 
To  ensure  both  the  sparsity  of  groups  and  within  each  group,  Simon,  Friedman, 
Hastie,  and  Tibshirani  (2013)  proposed  the  so-called  “sparse  Group  Lasso”  which 
uses  a  more  general  penalty  which  yields  sparsity  an  both  inter-  and  intragroup  level. 
The  sparse  Group  Lasso  estimate  solves  the  problem 


1 6  =  arg  min 

peRP 


K 


y  -  L  Xk  !ik 


k=  1 


K 


+  Ai  ^2  WPkh  +  k2\\p\\u 


(9.49) 


k  =  1 


where  /3  =  (/3i ,  fa, . . . ,  Pk)T  is  the  entire  parameter  vector. 
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Coefficient  paths 


Lambda 


Fig.  9.6  Lasso  estimates  of  standardised  regression  /3  j  for  car  data  with  n  =  74  and  p  =  12  Q 
MVAgr oup lasso 


ui«j  _ 


'  Summary 

^  Lasso  gives  a  sparse  solution.  Lasso  estimate  combines  best  of  both 
ridge  regression  and  subset  regression. 

If  there  is  a  group  of  variables  which  has  very  high  correlation,  then 
the  Lasso  tends  to  select  only  one  variable  from  the  group. 

^  The  LARS  algorithm  computes  the  whole  path  of  Lasso  solutions 
and  is  feasible  for  the  high-dimensional  case  p  n. 

^  Elastic  net  combines  good  features  of  Li-norm  and  L  2 -norm 
penalties. 

^  The  Elastic  net  is  very  useful  when  p  »  n  or  there  are  many 
correlated  variables. 

^  The  Sparse  Group  Lasso  can  perform  shrinkage  both  on  inter-  and 
intragroup  level. 
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9.4  Exercises 


Exercise  9.1  Derive  the  explicit  Lasso  estimate  in  (9.11)  for  the  orthonormal 
design  case. 


Exercise  9.2  Compare  Lasso  orthonormal  design  case  for  p  —  2  graphically  to 
ridge  regression ,  i.e.  to  the  problem  f>  —  argmin  j^-7=1  (y/  —  xj ff)2^  subject  to 

Y?j=\Pj2  <  s- 

Why  does  Lasso  produce  variable  selection  and  ridge  regression  does  not? 


Exercise  9.3  Optimise  the  value  of  s  such  that  the  fitted  model  in  Example  9.3 
produces  the  smallest  residual. 


Exercise  9.4  Optimise  the  value  of  s  such  that  the  fitted  model  in  Example  9.4 
produces  the  smallest  residual. 


Chapter  10 

Decomposition  of  Data  Matrices  by  Factors 


In  Chap.  1  basic  descriptive  techniques  were  developed  which  provided  tools 
for  “looking”  at  multivariate  data.  They  were  based  on  adaptations  of  bivariate 
or  univariate  devices  used  to  reduce  the  dimensions  of  the  observations.  In  the 
following  three  chapters,  issues  of  reducing  the  dimension  of  a  multivariate  data 
set  will  be  discussed.  The  perspectives  will  be  different  but  the  tools  will  be  related. 

In  this  chapter,  we  take  a  descriptive  perspective  and  show  how  using  a 
geometrical  approach  provides  a  “best”  way  of  reducing  the  dimension  of  a  data 
matrix.  It  is  derived  with  respect  to  a  least- squares  criterion.  The  result  will  be  low 
dimensional  graphical  pictures  of  the  data  matrix.  This  involves  the  decomposition 
of  the  data  matrix  into  “factors”.  These  “factors”  will  be  sorted  in  decreasing 
order  of  importance.  The  approach  is  very  general  and  is  the  core  idea  of  many 
multivariate  techniques.  We  deliberately  use  the  word  “factor”  here  as  a  tool  or 
transformation  for  structural  interpretation  in  an  exploratory  analysis.  In  practice, 
the  matrix  to  be  decomposed  will  be  some  transformation  of  the  original  data 
matrix  and  as  shown  in  the  following  chapters,  these  transformations  provide  easier 
interpretations  of  the  obtained  graphs  in  lower  dimensional  spaces. 

Chapter  1 1  addresses  the  issue  of  reducing  the  dimensionality  of  a  multivariate 
random  variable  by  using  linear  combinations  (the  principal  components).  The 
identified  principal  components  are  ordered  in  decreasing  order  of  importance. 
When  applied  in  practice  to  a  data  matrix,  the  principal  components  will  turn  out  to 
be  the  factors  of  a  transformed  data  matrix  (the  data  will  be  centred  and  eventually 
standardised). 

Factor  analysis  is  discussed  in  Chap.  12.  The  same  problem  of  reducing  the 
dimension  of  a  multivariate  random  variable  is  addressed  but  in  this  case  the 
number  of  factors  is  fixed  from  the  start.  Each  factor  is  interpreted  as  a  latent 
characteristic  of  the  individuals  revealed  by  the  original  variables.  The  non¬ 
uniqueness  of  the  solutions  is  dealt  with  by  searching  for  the  representation  with 
the  easiest  interpretation  for  the  analysis. 

Summarising,  this  chapter  can  be  seen  as  a  foundation  since  it  develops  a  basic 
tool  for  reducing  the  dimension  of  a  multivariate  data  matrix. 
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10.1  The  Geometric  Point  of  View 

As  a  matter  of  introducing  certain  ideas,  assume  that  the  data  matrix  X(n  x  p)  is 
composed  of  n  observations  (or  individuals)  of  p  variables. 

There  are  in  fact  two  ways  of  looking  at  A,  row  by  row  or  column  by  column: 

1.  Each  row  (observation)  is  a  vector  xj  =  (xn, ,  xtp)  e 

From  this  point  of  view  our  data  matrix  A  is  representable  as  a  cloud  of  n 
points  in  as  shown  in  Fig.  10.1. 

2.  Each  column  (variable)  is  a  vector  xy]  —  (xy , . . . ,  xnj)T  e  M/?. 

From  this  point  of  view  the  data  matrix  A  is  a  cloud  of  p  points  in  M/?  as 
shown  in  Fig.  10.2. 

When  n  and/or  p  are  large  (larger  than  2  or  3),  we  cannot  produce  interpretable 
graphs  of  these  clouds  of  points.  Therefore,  the  aim  of  the  factorial  methods  to  be 
developed  here  is  twofold.  We  shall  try  to  simultaneously  approximate  the  column 
space  C{X)  and  the  row  space  C  (AT)  with  smaller  subspaces.  The  hope  is  of  course 
that  this  can  be  done  without  losing  too  much  information  about  the  variation  and 
structure  of  the  point  clouds  in  both  spaces.  Ideally,  this  will  provide  insights  into 
the  structure  of  A  through  graphs  in  R,  M2  or  M3 .  The  main  focus  then  is  to  find  the 
dimension  reducing  factors. 


Fig.  10.1  Cloud  of  n  points  in 


Fig.  10.2  Cloud  of  p  points  in  R” 


10.2  Fitting  the  p -Dimensional  Point  Cloud 
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Summary 


^  Each  row  (individual)  of  X  is  a  -dimensional  vector.  From  this 
point  of  view  X  can  be  considered  as  a  cloud  of  n  points  in  Rp . 

Each  column  (variable)  of  A'  is  a  n -dimensional  vector.  From  this 
point  of  view  X  can  be  considered  as  a  cloud  of  p  points  in  R” . 


10.2  Fitting  the  p -Dimensional  Point  Cloud 

Subspaces  of  Dimension  1 

In  this  section  X  is  represented  by  a  cloud  of  n  points  in  Rp  (considering  each  row). 
The  question  is  how  to  project  this  point  cloud  onto  a  space  of  lower  dimension.  To 
begin  consider  the  simplest  problem,  namely  finding  a  subspace  of  dimension  1 .  The 
problem  boils  down  to  finding  a  straight  line  F\  through  the  origin.  The  direction  of 
this  line  can  be  defined  by  a  unit  vector  u\  E  Rp.  Hence,  we  are  searching  for  the 
vector  u\  which  gives  the  “best”  fit  of  the  initial  cloud  of  n  points.  The  situation  is 
depicted  in  Fig.  10.3. 

The  representation  of  the  i  th  individual  E  Rp  on  this  line  is  obtained  by  the 
projection  of  the  corresponding  point  onto  u\ ,  i.e.  the  projection  point  pXi .  We  know 
from  (2.42)  that  the  coordinate  of  Xj  on  F\  is  given  by 


T  M1  T 

Pxt  =  Xt  - - -  =  Xt  Mi. 

VL\ 


(10.1) 


Fig.  10.3  Projection  of  point  cloud  onto  u  space  of  lower  dimension 
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We  define  the  best  line  F\  in  the  following  “least-squares”  sense:  Find  u\  e  Rp 
which  minimises 


E 


(10.2) 


Since  || x\  —  pXi\\  —  \\xi\\~  ~  ||/bcjr  by  Pythagoras’s  theorem,  the  problem  of 
minimising  (10.2)  is  equivalent  to  maximising  Yf=i  \\Pxi  II 2 •  Thus  the  problem  is 
to  find  u\  e  that  maximises  Yf=\  ll^x/ll2  under  the  constraint  \\u\\\  —  1. 
With  (10.1)  we  can  write 


p»\ 

Px  2 

V  P-<n  / 


(  X{  Ml  ^ 
xj  Ml 


\xjuj 


=  Xu\ 


and  the  problem  can  finally  be  reformulated  as:  find  u\  e  with  \\u\ 
maximises  the  quadratic  form  (Xu\)T  (Xu\)  or 


=  1  that 


max  uj (XT X)u\.  (10.3) 

U J1"  Ml  —  1 


The  solution  is  given  by  Theorem  2.5  (using  A  —  XT X  and  B  —  1  in  the 
theorem). 

Theorem  10.1  The  vector  u\  which  minimises  (10.2)  is  the  eigenvector  of  XT X 
associated  with  the  largest  eigenvalue  X\  ofXTX. 

Note  that  if  the  data  have  been  centred,  i.e.  x  =  0,  then  A'  =  Xc,  where  Xc  is 
the  centred  data  matrix,  and  -XT X  is  the  covariance  matrix.  Thus  Theorem  10.1 
says  that  we  are  searching  for  a  maximum  of  the  quadratic  form  (10.3)  w.r.t.  the 
covariance  matrix  Sx  —  n~lXT X. 


Representation  of  the  Cloud  on  F\ 

The  coordinates  of  the  n  individuals  on  F\  are  given  by  Xu\.  Xu\  is  called  th q  first 
factorial  variable  or  the  first  factor  and  u\  the  first  factorial  axis.  The  n  individuals, 
Xj ,  are  now  represented  by  a  new  factorial  variable  z\  =  Xu\.  This  factorial  variable 
is  a  linear  combination  of  the  original  variables  (x[p, . . . ,  X[p])  whose  coefficients 
are  given  by  the  vector  iq,  i.e. 


Z\  —  UuX[i]  +  •  •  •  +  Up\X[p]. 


(10.4) 


10.2  Fitting  the  p -Dimensional  Point  Cloud 
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Fig.  10.4  Representation  of  the  individuals  x\, ...  ,xn  as  a  two-dimensional  point  cloud 


Subspaces  of  Dimension  2 

If  we  approximate  the  n  individuals  by  a  plane  (dimension  2),  it  can  be  shown  via 
Theorem  2.5  that  this  space  contains  u\.  The  plane  is  determined  by  the  best  linear 
fit  (u\)  and  a  unit  vector  w2  orthogonal  to  u\  which  maximises  the  quadratic  form 
uj  (XT X)u,2  under  the  constraints 


U2\  =  1, 


and  uj  U2 


=  0. 


Theorem  10.2  The  second  factorial  axis,  U2,  is  the  eigenvector  of  XT  X  corre¬ 
sponding  to  the  second  largest  eigenvalue  A 2  of  XT  X. 

The  unit  vector  w2  characterises  a  second  line,  F2 ,  on  which  the  points  are 
projected.  The  coordinates  of  the  n  individuals  on  F2  are  given  by  z2  =  Xu2. 
The  variable  z2  is  called  the  second  factorial  variable  or  the  second  factor.  The 
representation  of  the  n  individuals  in  two-dimensional  space  ( z\  —  Xu\  vs. 
Z2  —  XU2)  is  shown  in  Fig.  10.4. 


Subspaces  of  Dimension  q  (q  <  p) 

In  the  case  of  q  dimensions  the  task  is  again  to  minimise  (10.2)  but  with  projection 
points  in  a  g -dimensional  subspace.  Following  the  same  argument  as  above,  it  can 
be  shown  via  Theorem  2.5  that  this  best  subspace  is  generated  by  u\ ,  w2, . . . ,  uq,  the 
orthonormal  eigenvectors  of  XT  X  associated  with  the  corresponding  eigenvalues 
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Ai  >  A  2  ^  ^  A q.  The  coordinates  of  the  n  individuals  on  the  k th  factorial 

axis,  Uk,  are  given  by  the  kth  factorial  variable  Zk  —  Xuk  for  k  =  1 , ,q.  Each 
factorial  variable  Zk  —  (z\k,Z2k,  •  •  • ,  znk)T  is  a  linear  combination  of  the  original 
variables  X[q ,  xp] , . . . ,  X[p]  whose  coefficients  are  given  by  the  elements  of  the  kth 
Vector  Uk  .  Zik  —  1  -^im^mk- 


Summary 


The  -dimensional  point  cloud  of  individuals  can  be  graphically 
represented  by  projecting  each  element  into  spaces  of  smaller 
dimensions. 

^  The  first  factorial  axis  is  u\  and  defines  a  line  F\  through  the  origin. 
This  line  is  found  by  minimising  the  orthogonal  distances  (10.2). 
The  factor  u\  equals  the  eigenvector  of  XT  X  corresponding  to  its 
largest  eigenvalue.  The  coordinates  for  representing  the  point  cloud 
on  a  straight  line  are  given  by  z\  —  Xu\. 

^  The  second  factorial  axis  is  U2 ,  where  U2  denotes  the  eigenvector 
of  XT X  corresponding  to  its  second  largest  eigenvalue.  The 
coordinates  for  representing  the  point  cloud  on  a  plane  are  given 
by  zi  =  Xu\  and^2  =  Xu2. 

^  The  factor  directions  1 , ...  ,q  are  u\, . . . ,  uq,  which  denote  the 
eigenvectors  of  XT X  corresponding  to  the  q  largest  eigenvalues. 
The  coordinates  for  representing  the  point  cloud  of  individuals  on 
a  ^-dimensional  subspace  are  given  by  z  1  =  Xu\ , ...  ,zq  —  Xuq. 


10.3  Fitting  the  n -Dimensional  Point  Cloud 

Subspaces  of  Dimension  1 

Suppose  that  X  is  represented  by  a  cloud  of  p  points  (variables)  in  M/?  (considering 
each  column).  How  can  this  cloud  be  projected  into  a  lower  dimensional  space?  We 
start  as  before  with  one  dimension.  In  other  words,  we  have  to  find  a  straight  line 
G 1,  which  is  defined  by  the  unit  vector  v\  el",  and  which  gives  the  best  fit  of  the 
initial  cloud  of  p  points. 

Algebraically,  this  is  the  same  problem  as  above  (replace  A  by  XT  and  follow 
Sect.  10.2):  the  representation  of  the  j th  variable  xy]  e  W1  is  obtained  by  the 
projection  of  the  corresponding  point  onto  the  straight  line  G\  or  the  direction  v\. 
Hence  we  have  to  find  V\  such  that  || pX[j]  || 2  is  maximised,  or  equivalently,  we 

have  to  find  the  unit  vector  V\  which  maximises  (XT v\)T (Xv\)  —  vj (XXT)v\. 
The  solution  is  given  by  Theorem  2.5. 
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Theorem  10.3  V\  is  the  eigenvector  of  corresponding  to  the  largest  eigen¬ 
value  /jL\  of  . 


Representation  of  the  Cloud  on  G\ 

The  coordinates  of  the  p  variables  on  G\  are  given  by  w\  =  XTv\,  the  first  factorial 
axis.  The  p  variables  are  now  represented  by  a  linear  combination  of  the  original 
individuals  x\, . . . ,  xn,  whose  coefficients  are  given  by  the  vector  i.e.  for  j  = 
1 


wlj  —  Vi\X\ j  +  •  •  •  +  V\ nxnj. 


(10.5) 


Subspaces  of  Dimension  q  {q  <  n) 

The  representation  of  the  p  variables  in  a  subspace  of  dimension  q  is  done  in  the 
same  manner  as  for  the  n  individuals  above.  The  best  subspace  is  generated  by  the 
orthonormal  eigenvectors  v\,  i>2, . . . ,  vq  of  associated  with  the  eigenvalues 

Mi  >  M2  >  •••  >  M#-  The  coordinates  of  the  p  variables  on  the  kt\\  factorial 
axis  are  given  by  the  factorial  variables  Vk,  k  =  Each 

factorial  variable  w \  —  (wk i  ,  Wk2,  •  •  • ,  Wkp)T  is  a  linear  combination  of  the  original 
individuals  x\,  X2, . . . ,  xn  whose  coefficients  are  given  by  the  elements  of  the  kth 
vector  Vk  :  Wkj  —  Yfm= l  vkmXmj •  The  representation  in  a  subspace  of  dimension 
q  —  2  is  depicted  in  Fig.  10.5. 


j-th  variable 

f 


Fig.  10.5  Representation  of  the  variables  vp], . . . ,  X[p]  as  a  two-dimensional  point  cloud 
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uu 


'  Summary 

^  The  n -dimensional  point  cloud  of  variables  can  be  graphically 
represented  by  projecting  each  element  into  spaces  of  smaller 
dimensions. 

^  The  first  factor  direction  is  V\  and  defines  a  line  G\  through  the 
origin.  The  vector  v\  equals  the  eigenvector  of  XXT  corresponding 
to  the  largest  eigenvalue  of  XXT .  The  coordinates  for  representing 
the  point  cloud  on  a  straight  line  are  w\  —  XT  v\. 

^  The  second  factor  direction  is  i?2,  where  i>2  denotes  the  eigenvector 
of  XXT  corresponding  to  its  second  largest  eigenvalue.  The 
coordinates  for  representing  the  point  cloud  on  a  plane  are  given 
by  w\  —  XTv\  andw2  =  XT 

^  The  factor  directions  l, ...  ,q  are  V\, ...  ,vq,  which  denote  the 
eigenvectors  of  XXT  corresponding  to  the  q  largest  eigenvalues. 
The  coordinates  for  representing  the  point  cloud  of  variables  on  a  q- 
dimensional  subspace  are  given  by  w\  —  XTv\, . . . ,  wq  —  vq. 


10.4  Relations  Between  Subspaces 

The  aim  of  this  section  is  to  present  a  duality  relationship  between  the  two 
approaches  shown  in  Sects.  10.2  and  10.3.  Consider  the  eigenvector  equations  in  M77 

C XXT)vk  =  [ikvk  (10.6) 

for  k  <  r,  where  r  —  rank(XXT)  —  rank(T')  <  min (p,n).  Multiplying  by  XT , 
we  have 


XT(XXT)vk  =  nkXTvk  (10.7) 

or  (XTX)(XTvk)  =  nk(XTvk)  (10.8) 

so  that  each  eigenvector  vk  of  XX'  corresponds  to  an  eigenvector  (X'  vk)  of  X'X 
associated  with  the  same  eigenvalue  jik.  This  means  that  every  nonzero  eigenvalue 
of  XXT  is  an  eigenvalue  of  XT  X.  The  corresponding  eigenvectors  are  related  by 

Mjc  —  Ck  X  Vk , 


where  ck  is  some  constant. 
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Now  consider  the  eigenvector  equations  in  Rp : 

(XTX)uk  =  huk  (10.9) 

for  k  <  r.  Multiplying  by  A',  we  have 

(. XXT)(Xuk )  =  A  k(Xuk),  (10.10) 

i.e.  each  eigenvector  uk  of  X'X  corresponds  to  an  eigenvector  Xuk  of  XX' 
associated  with  the  same  eigenvalue  A&.  Therefore,  every  nonzero  eigenvalue  of 
(XT X)  is  an  eigenvalue  of  XXT .  The  corresponding  eigenvectors  are  related  by 

Vk  —  dkXuk , 

where  dk  is  some  constant.  Now,  since  uj  Uk  —  vjvk  —  1  we  have  Ck  —  dk  —  ^=. 
This  lead  to  the  following  result: 

Theorem  10.4  (Duality  Relations)  Let  r  be  the  rank  of  X.  For  k  <  r,  the 
eigenvalues  A k  of  XT  X  and  XXT  are  the  same  and  the  eigenvectors  (uk  and  Vk, 
respectively )  are  related  by 


Uk  = 


vk  = 


(10.11) 

1 

— -=Xuk. 

VA  * 

(10.12) 

Note  that  the  projection  of  the  p  variables  on  the  factorial  axis  Vk  is  given  by 


Wk 


—  Vk  —  r—  X~^  Xuk  —  Vh 
w^k 


Uk 


(10.13) 


Therefore,  the  eigenvectors  Vk  do  not  have  to  be  explicitly  recomputed  to  get  Wk- 
Note  that  Uk  and  Vk  provide  the  SVD  of  X  (see  Theorem  2.2).  Letting 
U  —  [u\  U2  ...  ur],  V  =  [v\  V2  ...  vr]  and  A  =  diag(Ai, . . . ,  Ar)  we  have 


X  —  V  A1/2  UT 


so  that 


Xij  —  pk2  Vik  ujk-  (10- 14) 

k=  1 

In  the  following  section  this  method  is  applied  in  analysing  consumption 
behaviour  across  different  household  types. 
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# 

Summary 

/  A 

^  The  matrices  XT  X  and  XXT  have  the  same  nonzero  eigenvalues 

Ai, . . . ,  Xr,  where  r  =  rank(A). 

The  eigenvectors  of  XT  X  can  be  calculated  from  the  eigenvectors 
of  XXT  and  vice  versa: 


XTVk  and  vt  =  —=Xuk. 

V  V 


The  coordinates  representing  the  variables  (columns)  of  X  in  a 
^-dimensional  subspace  can  be  easily  calculated  by  w k  —  \fiXkUk. 


10.5  Practical  Computation 

The  practical  implementation  of  the  techniques  introduced  begins  with  the  compu¬ 
tation  of  the  eigenvalues  X\  >  A  2  >  •  •  •  >  Xp  and  the  corresponding  eigenvectors 
u\,...,up  of  XTX.  (Since  p  is  usually  less  than  n ,  this  is  numerically  less 
involved  than  computing  v k  directly  for  k  —  1, . . . ,  p.)  The  representation  of  the 
n  individuals  on  a  plane  is  then  obtained  by  plotting  z\  —  Xu\  versus  zi  —  Xu2 
(Z3  —  XU3  may  eventually  be  added  if  a  third  dimension  is  helpful).  Using  the 
Duality  Relation  (10.13)  representations  for  the  p  variables  can  easily  be  obtained. 
These  representations  can  be  visualised  in  a  scatterplot  of  w\  —  +JX~\  u\  against 
W2  =  VX2U2  (and  eventually  against  W3  =  M3).  Higher  dimensional  factorial 

resolutions  can  be  obtained  (by  computing  Zk  and  Wk  for  k  >  3)  but,  of  course, 
cannot  be  plotted. 

A  standard  way  of  evaluating  the  quality  of  the  factorial  representations  in  a 
subspace  of  dimension  q  is  given  by  the  ratio 


Ai  +  X2  +  ‘  ‘  ‘  +  Xq 

X\  +  X2  +  •  •  •  +  Xp 


(10.15) 


where  0  <  xq  <  1.  In  general,  the  scalar  product  yT y  is  called  the  inertia  of  y  e  M77 
w.r.t.  the  origin.  Therefore,  the  ratio  xq  is  usually  interpreted  as  the  percentage  of 
the  inertia  explained  by  the  first  q  factors.  Note  that  X j  —  (Xuj)T (Xu j)  —  zjzj. 
Thus,  X  j  is  the  inertia  of  the  j  th  factorial  variable  w.r.t.  the  origin.  The  denominator 
in  (10.15)  is  a  measure  of  the  total  inertia  of  the  p  variables,  xyy  Indeed,  by  (2.3) 


P  P  n  p 

Xj  =  tr(xTx)  =  =  IAiWi- 

j  =  1  j  =  1  i  =  1  7  =  1 
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Remark  10.1  It  is  clear  that  the  sum  Y^j  =  \  ^  j  the  sum  of  the  inertia  of  the  first  q 
factorial  variables  zi ,  Z2,  •  •  • ,  zq . 

Example  10.1  We  consider  the  data  set  in  Table  22.6  which  gives  the  food 
expenditures  of  various  French  families  (manual  workers  =  MA,  employees  =  EM, 
managers  =  CA)  with  varying  numbers  of  children  (2,  3,  4  or  5  children).  We  are 
interested  in  investigating  whether  certain  household  types  prefer  certain  food  types. 
We  can  answer  this  question  using  the  factorial  approximations  developed  here. 
The  correlation  matrix  corresponding  to  the  data  is 


/1.00 

0.59 

0.20 

0.32 

0.25 

0.86 

0.30  \ 

0.59 

1.00 

0.86 

0.88 

0.83 
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0.20 

0.86 
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0.98 
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0.86 

0.66 

0.33 

0.37 

0.23 

1.00 

0.01 

y  0.30  - 

-0.36  - 

-0.49  - 

-0.44  - 

-0.40 

0.01 

1.00/ 

We  observe  a  rather  high  correlation  (0.98)  between  meat  and  poultry,  whereas 
the  correlation  for  expenditure  for  milk  and  wine  (0.01)  is  rather  small.  Are  there 
household  types  that  prefer,  say,  meat  over  bread? 

We  shall  now  represent  food  expenditures  and  households  simultaneously  using 
two  factors.  First,  note  that  in  this  particular  problem  the  origin  has  no  specific 
meaning  (it  represents  a  “zero”  consumer).  So  it  makes  sense  to  compare  the 
consumption  of  any  family  to  that  of  an  “average  family”  rather  than  to  the  origin. 
Therefore,  the  data  is  first  centred  (the  origin  is  translated  to  the  centre  of  gravity, 
x).  Furthermore,  since  the  dispersions  of  the  seven  variables  are  quite  different  each 
variable  is  standardised  so  that  each  has  the  same  weight  in  the  analysis  (mean 
0  and  variance  1).  Finally,  for  convenience,  we  divide  each  element  in  the  matrix 
by  +Jn  —  \f\ 2.  (This  will  only  change  the  scaling  of  the  plots  in  the  graphical 
representation.) 

The  data  matrix  to  be  analysed  is 


HX  V~l/2, 


where  EL  is  the  centering  matrix  and  V  —  dia g(sxixi)  (see  Sect.  3.3).  Note  that  by 
standardising  by  ^/n,  it  follows  that  Xj  X*  —  1Z  where  1Z  is  the  correlation  matrix 
of  the  original  data.  Calculating 

A  =  (4.33, 1.83, 0.63, 0.13, 0.06, 0.02, 0.00)T 

shows  that  the  directions  of  the  first  two  eigenvectors  play  a  dominant  role  {X2  — 
88  %),  whereas  the  other  directions  contribute  less  than  15  %  of  inertia.  A  two- 
dimensional  plot  should  suffice  for  interpreting  this  data  set. 
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Fig.  10.6  Representation  of  food  expenditures  and  family  types  in  two  dimensions  Q 
MVAdecof ood 


The  coordinates  of  the  projected  data  points  are  given  in  the  two  lower  windows 
of  Fig.  10.6.  Let  us  first  examine  the  food  expenditure  window.  In  this  window  we 
see  the  representation  of  the  p  —  7  variables  given  by  the  first  two  factors.  The 
plot  shows  the  factorial  variables  w\  and  W2  in  the  same  fashion  as  Fig.  10.4.  We 
see  that  the  points  for  meat,  poultry,  vegetables  and  fruits  are  close  to  each  other  in 
the  lower  left  of  the  graph.  The  expenditures  for  bread  and  milk  can  be  found  in  the 
upper  left,  whereas  wine  stands  alone  in  the  upper  right.  The  first  factor,  w\ ,  may 
be  interpreted  as  the  meat/fruit  factor  of  consumption,  the  second  factor,  W2,  as  the 
bread/wine  component. 

In  the  lower  window  on  the  right-hand  side,  we  show  the  factorial  variables  z i  and 
Z2  from  the  fit  of  the  n  —  12  household  types.  Note  that  by  the  Duality  Relations 
of  Theorem  10.4,  the  factorial  variables  zj  are  linear  combinations  of  the  factors 
Wk  from  the  left  window.  The  points  displayed  in  the  consumer  window  (graph  on 
the  right)  are  plotted  relative  to  an  average  consumer  represented  by  the  origin. 
The  manager  families  are  located  in  the  lower  left  corner  of  the  graph  whereas  the 
manual  workers  and  employees  tend  to  be  in  the  upper  right.  The  factorial  variables 
for  CA5  (managers  with  five  children)  lie  close  to  the  meat/fruit  factor.  Relative  to 
the  average  consumer  this  household  type  is  a  large  consumer  of  meat/poultry  and 
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fruits/vegetables.  In  Chap.  11,  we  will  return  to  these  plots  interpreting  them  in  a 
much  deeper  way.  At  this  stage,  it  suffices  to  notice  that  the  plots  provide  a  graphical 
representation  in  Mr  of  the  information  contained  in  the  original,  high-dimensional 
(12  x  7)  data  matrix. 


Summary 

The  practical  implementation  of  factor  decomposition  of  matrices 
consists  of  computing  the  eigenvalues  A  i , . . . ,  Xp  and  the  eigenvec¬ 
tors  u\, ...  ,up  of  WT  AA  The  representation  of  the  n  individuals  is 
obtained  by  plotting  z\  —  Xu\  vs.  zi  —  Wu2  (and,  if  necessary, 
vs.  Z3  =  Xuf).  The  representation  of  the  p  variables  is  obtained 
by  plotting  w\  —  +/\fu\  vs.  W2  =  \T^2M2  (and,  if  necessary,  vs. 
W3  —  VA3M3). 

^  The  quality  of  the  factorial  representation  can  be  evaluated  using  xq 
which  is  the  percentage  of  inertia  explained  by  the  first  q  factors. 


10.6  Exercises 

Exercise  10.1  Prove  that  n~l  ZT  Z  is  the  covariance  of  the  centred  data  matrix , 
where  Z  is  the  matrix  formed  by  the  columns  Zk  —  Wuk. 

Exercise  10.2  Compute  the  SVD  of  the  French  food  data  ( Table  22.6). 

Exercise  10.3  Compute  x^ ,  T4, . . .  for  the  French  food  data  ( Table  22.6). 

Exercise  10.4  Apply  the  factorial  techniques  to  the  Swiss  bank  notes  (Sect.  22.2). 

Exercise  10.5  Apply  the  factorial  techniques  to  the  time  budget  data  ( Table  22.14). 

Exercise  10.6  Assume  that  you  wish  to  analyse  p  independent  identically  dis¬ 
tributed  random  variables.  What  is  the  percentage  of  the  inertia  explained  by  the 
first  factor?  What  is  the  percentage  of  the  inertia  explained  by  the  first  q  factors? 

Exercise  10.7  Assume  that  you  have  p  i.i.d.  r.v.’s.  What  does  the  eigenvector, 
corresponding  to  the  first  factor,  look  like. 

Exercise  10.8  Assume  that  you  have  two  random  variables,  X\  and  X2  —  2X\. 
What  do  the  eigenvalues  and  eigenvectors  of  their  correlation  matrix  look  like?  How 
many  eigenvalues  are  nonzero? 
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Exercise  10.9  What  percentage  of  inertia  is  explained  by  the  first  factor  in  the 
previous  exercise  ? 

Exercise  10.10  How  do  the  eigenvalues  and  eigenvectors  in  Example  10.1  change 
if  we  take  the  prices  in  USD  instead  of  in  EUR?  Does  it  make  a  difference  if  some 
of  the  prices  are  in  EUR  and  others  in  USD? 


Chapter  11 

Principal  Components  Analysis 


Chapter  10  presented  the  basic  geometric  tools  needed  to  produce  a  lower  dimen¬ 
sional  description  of  the  rows  and  columns  of  a  multivariate  data  matrix.  Principal 
components  analysis  (PC A)  has  the  same  objective  with  the  exception  that  the  rows 
of  the  data  matrix  A  will  now  be  considered  as  observations  from  a  -variate 
random  variable  X .  The  principle  idea  of  reducing  the  dimension  of  X  is  achieved 
through  linear  combinations.  Low  dimensional  linear  combinations  are  often  easier 
to  interpret  and  serve  as  an  intermediate  step  in  a  more  complex  data  analysis.  More 
precisely  one  looks  for  linear  combinations  which  create  the  largest  spread  among 
the  values  of  A.  In  other  words,  one  is  searching  for  linear  combinations  with  the 
largest  variances. 

Section  11.1  introduces  the  basic  ideas  and  technical  elements  behind  principal 
components.  No  particular  assumption  will  be  made  on  X  except  that  the  mean 
vector  and  the  covariance  matrix  exist.  When  reference  is  made  to  a  data  matrix  A 
in  Sect.  11.2,  the  empirical  mean  and  covariance  matrix  will  be  used.  Section  11.3 
shows  how  to  interpret  the  principal  components  by  studying  their  correlations 
with  the  original  components  of  X .  Often  analyses  are  performed  in  practice  by 
looking  at  two-dimensional  scatterplots.  Section  1 1.4  develops  inference  techniques 
on  principal  components.  This  is  particularly  helpful  in  establishing  the  appropriate 
dimension  reduction  and  thus  in  determining  the  quality  of  the  resulting  lower 
dimensional  representations.  Since  principal  component  analysis  is  performed  on 
covariance  matrices,  it  is  not  scale  invariant.  Often,  the  measurement  units  of 
the  components  of  X  are  quite  different,  so  it  is  reasonable  to  standardise  the 
measurement  units.  The  normalised  version  of  principal  components  is  defined  in 
Sect.  1 1.5.  In  Sect.  1 1.6  it  is  discovered  that  the  empirical  principal  components  are 
the  factors  of  appropriate  transformations  of  the  data  matrix.  The  classical  way 
of  defining  principal  components  through  linear  combinations  with  respect  to  the 
largest  variance  is  described  here  in  geometric  terms,  i.e.  in  terms  of  the  optimal  fit 
within  subspaces  generated  by  the  columns  and/or  the  rows  of  A  as  was  discussed 
in  Chap.  10.  Section  11.9  concludes  with  additional  examples. 
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1 1  Principal  Components  Analysis 


11.1  Standardised  Linear  Combination 

The  main  objective  of  PC  A  is  to  reduce  the  dimension  of  the  observations.  The 
simplest  way  of  dimension  reduction  is  to  take  just  one  element  of  the  observed 
vector  and  to  discard  all  others.  This  is  not  a  very  reasonable  approach,  as  we  have 
seen  in  the  earlier  chapters,  since  strength  may  be  lost  in  interpreting  the  data.  In 
the  bank  notes  example  we  have  seen  that  just  one  variable  (e.g.  X\  —  length) 
had  no  discriminatory  power  in  distinguishing  counterfeit  from  genuine  bank  notes. 
An  alternative  method  is  to  weight  all  variables  equally,  i.e.  to  consider  the  simple 
average  p~l  ^ J=1  Xj  of  all  the  elements  in  the  vector  X  =  (X\, . . . ,  X^)T .  This 
again  is  undesirable,  since  all  of  the  elements  of  X  are  considered  with  equal 
importance  (weight). 

A  more  flexible  approach  is  to  study  a  weighted  average,  namely 

p  p 

8T  X  =  ^^8jXj,  such  that  =  1.  (11.1) 

7=1  7=1 

The  weighting  vector  8  =  (Si, . . . ,  8P)T  can  then  be  optimised  to  investigate 
and  to  detect  specific  features.  We  call  (11.1)  a  standardised  linear  combination 
(SLC).  Which  SLC  should  we  choose?  One  aim  is  to  maximise  the  variance  of  the 
projection  STX,  i.e.  to  choose  8  according  to 

max  Var(STA)  =  max  STVar(A)S.  (11.2) 

{5:||5||  =  1}  {5:||5||  =  1} 

The  interesting  “directions”  of  8  are  found  through  the  spectral  decomposition  of 
the  covariance  matrix.  Indeed,  from  Theorem  2.5,  the  direction  8  is  given  by  the 
eigenvector  y\  corresponding  to  the  largest  eigenvalue  X\  of  the  covariance  matrix 
£  =  Var(X). 

Figures  11.1  and  11.2  show  two  such  projections  (SLCs)  of  the  same  data  set 
with  zero  mean.  In  Fig.  11.1  an  arbitrary  projection  is  displayed.  The  upper  window 
shows  the  data  point  cloud  and  the  line  onto  which  the  data  are  projected.  The 
middle  window  shows  the  projected  values  in  the  selected  direction.  The  lower 
window  shows  the  variance  of  the  actual  projection  and  the  percentage  of  the  total 
variance  that  is  explained. 

Figure  1 1.2  shows  the  projection  that  captures  the  majority  of  the  variance  in  the 
data.  This  direction  is  of  interest  and  is  located  along  the  main  direction  of  the  point 
cloud.  The  same  line  of  thought  can  be  applied  to  all  data  orthogonal  to  this  direction 
leading  to  the  second  eigenvector.  The  SLC  with  the  highest  variance  obtained  from 
maximising  (11.2)  is  the  first  principal  component  (PC)  y\  —  yj  X .  Orthogonal  to 
the  direction  y\  we  find  the  SLC  with  the  second  highest  variance:  y 2  —  yJX ,  the 
second  PC. 
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Fig.  11.1  An  arbitrary  SLC  Q  MVApcasimu 
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Fig.  11.2  The  most  interesting  SLC  Q  MVApcasimu 


Proceeding  in  this  way  and  writing  in  matrix  notation,  the  result  for  a  random 
variable  X  with  E(X)  =  /z  and  Var(X)  =  E  =  TArT  is  the  PC  transformation 
which  is  defined  as 


y  =  rT(x  -  n). 


(11.3) 


Here  we  have  centred  the  variable  X  in  order  to  obtain  a  zero  mean  PC  variable  Y . 
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Example  11.1  Consider  a  bivariate  normal  distribution  N( 0,  E)  with  E  =  (p  j*) 

and  p  >  0  (see  Example  3.13).  Recall  that  the  eigenvalues  of  this  matrix  are  Ai  = 
1  +  p  and  A2  =  1  —  p  with  corresponding  eigenvectors 


Vi  = 


V2  V1 


Yl  = 


V2 


1 

-1 


The  PC  transformation  is  thus 


Y  =  Tt(X  -pi)  = 


1/1  1 


V2  V 1  -1 


or 


Y\ 

Y2 


J_  (Xx  +  X2\ 

V2  V  Xi  ~  X2  )  ‘ 


So  the  first  principal  component  is 


F,  = 


V2 


{Xx  +  X2) 


and  the  second  is 


Y2  =  -^(Xx  -  X2). 

V2 


Let  us  compute  the  variances  of  these  PCs  using  formulas  (4.22)-(4.26): 


J  =  i  Var(Z!  +  X2) 

=  l-  {Var(Afi)  +  Var(X2)  +  2  Cov^,  X2)} 

—  -(1  +  1  +  2p)  =  1  +  p 

—  X  \ . 

Similarly  we  find  that 


Var(T0  =  Var  j  -h(A'l  +  X2) 


Var(72)  —  2,2* 


This  can  be  expressed  more  generally  and  is  given  in  the  next  theorem. 
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Theorem  11.1  For  a  given  X  ~  (/x,  X)  let  Y  —  Tt(X  —  /i)  be  the  PC 
transformation.  Then 


EYj  =  0,  j  =  (11.4) 

Var(F/)  =  Xj,  j  —  l, p  (11.5) 

Cov(F,.  Yj)  =  0,  ijtj  (11.6) 

Var(F0  >  Var(F2)  >  •••  >  Var(F,,)  >0  (11.7) 

p 

^Var(y»  =  tr(E)  (11.8) 

7=1 

P 

n  Var(Fy)  =  |E|.  (11.9) 

7=1 


Proof  To  prove  (1 1.6),  we  use  ]//  to  denote  the  i  th  column  of  T.  Then 

Cov(F,.  Yj)  =  K,t  VarlA'  -  ^)Yj  =  yj  Var(Z)yy. 

As  Var(X)  =  S  =  TATt,  TtT  =  X,  we  obtain  via  the  orthogonality  of  T : 


yj  rAr  TYj  = 


*  ~h  j , 

i  =  J'- 


In  fact,  as  7?  =  y7T  (X  —  fi)  lies  in  the  eigenvector  space  corresponding  to  y, ,  and 
eigenvector  spaces  corresponding  to  different  eigenvalues  are  orthogonal  to  each 
other,  we  can  directly  see  Yt  and  Yj  are  orthogonal  to  each  other,  so  their  covariance 
is  0.  □ 

The  connection  between  the  PC  transformation  and  the  search  for  the  best  SLC  is 
made  in  the  following  theorem,  which  follows  directly  from  (1 1 .2)  and  Theorem  2.5. 

Theorem  11.2  There  exists  no  SLC  that  has  larger  variance  than  X  \  =  Var(Ti). 

Theorem  11.3  If  Y  —  aT  X  is  an  SLC  that  is  not  correlated  with  the  first  k  PCs  of 
X,  then  the  variance  ofY  is  maximised  by  choosing  it  to  be  the  ( k  +  l)-st  PC. 


Summary 

An  SLC  is  a  weighted  average  ft1  X  —  J2j= i  $jXj  where  8  is  a 
vector  of  length  1 . 
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Summary  (continued) 

Maximising  the  variance  of  81  X  leads  to  the  choice  8  —  yi,  the 
eigenvector  corresponding  to  the  largest  eigenvalue  X\  of  £  = 
Var(X). 

This  is  a  projection  of  X  into  the  one-dimensional  space,  where 
the  components  of  X  are  weighted  by  the  elements  of  y\ .  Y\  — 

y7  (x  -  M  ) 

is  called  the  first  principal  component  (PC). 

This  projection  can  be  generalised  for  higher  dimensions.  The  PC 
transformation  is  the  linear  transformation  Y  —  Tt(X  —  fi),  where 
E  =  Var(X)  =  TATt  and  /x  =  E  X. 

Y\ ,  Y2, . . . ,  Yp  are  called  the  first,  second,. . . ,  and  p- th  PCs. 

The  PCs  have  zero  means,  variance  Var(T/)  =  A y,  and  zero 
covariances.  From  Ai  >  •••  >  Xp  it  follows  that  Var(Fj)  > 
...  >  Var(T^).  It  holds  that  J]y=iVar (Y))  =  tr(E)  and 

n;=iVar(F/)  =  |S|. 

If  Y  —  aT X  is  an  SLC  which  is  not  correlated  with  the  first  k  PCs 
of  X,  then  the  variance  of  Y  is  maximised  by  choosing  it  to  be  the 
(k  +  l)-st  PC. 


11.2  Principal  Components  in  Practice 

In  practice  the  PC  transformation  has  to  be  replaced  by  the  respective  estimators:  /x 
becomes  x,  E  is  replaced  by  S,  etc.  If  g i  denotes  the  first  eigenvector  of  S,  the  first 
principal  component  is  given  by  y  i  =  (X  —  lnxJ)g\ .  More  generally  if  S  —  QCQT 
is  the  spectral  decomposition  of  S ,  then  the  PCs  are  obtained  by 

y  =  (x  -  i„xT)g.  duo) 

Note  that  with  the  centering  matrix  %  =  I  —  (n~l  1„  1 J)  and  'H I  „ x T  =  0  we  can 
write 


s-y  =  n~lyTuy  =  n~lgT(x  - 1  „xt)th(x  - 1  „xT)g 

=  n~lgT  xTnxg  =  gTsg  =  c  (ii.ii) 

where  C  —  diag(f  i, . . . ,  tp)  is  the  matrix  of  eigenvalues  of  S.  Hence  the  variance 
of  yi  equals  the  eigenvalue  l , ! 

The  PC  technique  is  sensitive  to  scale  changes.  If  we  multiply  one  variable  by  a 
scalar  we  obtain  different  eigenvalues  and  eigenvectors.  This  is  due  to  the  fact  that 
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First  vs.  Second  PC 
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Fig.  11.3  Principal  components  of  the  bank  data  Q  MVApcabank 


an  eigenvalue  decomposition  is  performed  on  the  covariance  matrix  and  not  on  the 
correlation  matrix  (see  Sect.  11.5).  The  following  warning  is  therefore  important: 


/  !  \ 

L _ ! _ A  The  PC  transformation  should  be  applied  to  data  that  have  approximately 

the  same  scale  in  each  variable. 


Example  11.2  Let  us  apply  this  technique  to  the  bank  data  set.  In  this  example  we 
do  not  standardise  the  data.  Figure  11.3  shows  some  PC  plots  of  the  bank  data  set. 
The  genuine  and  counterfeit  bank  notes  are  marked  by  “o”  and  respectively. 
Recall  that  the  mean  vector  of  X  is 

x  =  (214.9, 130.1, 129.9,  9.4, 10.6, 140.5)T  . 

The  vector  of  eigenvalues  of  S  is 

i  =  (2.985, 0.931, 0.242, 0.194, 0.085,  0.035)t  . 
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The  eigenvectors  gj  are  given  by  the  columns  of  the  matrix 

/  -0.044  0.011  0.326  0.562  -0.753  0.098  \ 

0.112  0.071  0.259  0.455  0.347  -0.767 
0.139  0.066  0.345  0.415  0.535  0.632 
^  “  0.768  -0.563  0.218  -0.186  -0.100  -0.022  ' 

0.202  0.659  0.557  -0.451  -0.102  -0.035 
\  -0.579  -0.489  0.592  -0.258  0.085  -0.046 / 

The  first  column  of  Q  is  the  first  eigenvector  and  gives  the  weights  used  in  the  linear 
combination  of  the  original  data  in  the  first  PC. 

Example  11.3  To  see  how  sensitive  the  PCs  are  to  a  change  in  the  scale  of  the 
variables,  assume  that  X\,  X2,  X3  and  X 6  are  measured  in  cm  and  that  X4  and  X5 
remain  in  mm  in  the  bank  data  set.  This  leads  to: 

x  =  (21.49,  13.01,  12.99,  9.41,  10.65,  14.05)T. 

The  covariance  matrix  can  be  obtained  from  S  in  (3.4)  by  dividing  rows  1,  2,  3,  6 
and  columns  1,  2,  3,  6  by  10.  We  obtain: 

l  =  (2.101,  0.623,  0.005,  0.002,  0.001,  0.0004)T 

which  clearly  differs  from  Example  1 1.2.  Only  the  first  two  eigenvectors  are  given: 

gi  =  (-0.005,  0.011,  0.014,  0.992,  0.113,  -0.052)T 
g2  =  (-0.001,  0.013,  0.016,  -0.117,  0.991,  -0.069)T. 

Comparing  these  results  to  the  first  two  columns  of  Q  from  Example  11.2,  a 
completely  different  story  is  revealed.  Here  the  first  component  is  dominated  by  X4 
(lower  margin)  and  the  second  by  X5  (upper  margin),  while  all  of  the  other  variables 
have  much  less  weight.  The  results  are  shown  in  Fig.  11.4.  Section  11.5  will  show 
how  to  select  a  reasonable  standardisation  of  the  variables  when  the  scales  are  too 
different. 


HI mj  * 

Summary 

I  A 

^  The  scale  of  the  variables  should  be  roughly  the  same  for  PC 
transformations. 

11.3  Interpretation  of  the  PCs 
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Summary  (continued) 

^  For  the  practical  implementation  of  PCA  we  replace  \i  by  the  mean 
x  and  E  by  the  empirical  covariance  S.  Then  we  compute  the 
eigenvalues  l\, . . . ,  tp  and  the  eigenvectors  gi,. ..  ,gp  of  S.  The 
graphical  representation  of  the  PCs  is  obtained  by  plotting  the  first 
PC  vs.  the  second  (and  eventually  vs.  the  third). 

The  components  of  the  eigenvectors  gi  are  the  weights  of  the 
original  variables  in  the  PCs. 


First  vs.  Second  PC 
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Fig.  11.4  Principal  components  of  the  rescaled  bank  data  G  MVApcabankr 


11.3  Interpretation  of  the  PCs 

Recall  that  the  main  idea  of  PC  transformations  is  to  find  the  most  informative 
projections  that  maximise  variances.  The  most  informative  SLC  is  given  by  the 
first  eigenvector.  In  Sect.  11.2  the  eigenvectors  were  calculated  for  the  bank  data. 
In  particular,  with  centred  x’s,  we  had: 


y  i  =  —0.044xi  +  0.112x2  +  0.139x3  +  0.768x4  +  0.202x5  —  0.579x6 
=  0.01  lxi  +  0.071x2  +  O.O66X3  —  0.563x4  +  0.659xs  —  0.489x6 
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and 


X]  —  length 
X2  =  left  height 
X3  =  right  height 
X4  =  bottom  frame 
x$  —  top  frame 
X6  =  diagonal. 

Hence,  the  first  PC  is  essentially  the  difference  between  the  bottom  frame 
variable  and  the  diagonal.  The  second  PC  is  best  described  by  the  difference  between 
the  top  frame  variable  and  the  sum  of  bottom  frame  and  diagonal  variables. 

The  weighting  of  the  PCs  tells  us  in  which  directions,  expressed  in  original 
coordinates,  the  best  variance  explanation  is  obtained.  A  measure  of  how  well  the 
first  q  PCs  explain  variation  is  given  by  the  relative  proportion: 

I>r  (Yj) 

%  =  =  ^ - •  (11-12) 

X>  Evar  (Yj) 

7=1  7=1 

Referring  to  the  bank  data  Example  11.2,  the  (cumulative)  proportions  of 
explained  variance  are  given  in  Table  11.1.  The  first  PC  (q  =  1)  already  explains 
67  %  of  the  variation.  The  first  three  (< q  =  3)  PCs  explain  93  %  of  the  variation. 
Once  again  it  should  be  noted  that  PCs  are  not  scale  invariant,  e.g.  the  PCs  derived 
from  the  correlation  matrix  give  different  results  than  the  PCs  derived  from  the 
covariance  matrix  (see  Sect.  11.5). 

A  good  graphical  representation  of  the  ability  of  the  PCs  to  explain  the  variation 
in  the  data  is  given  by  the  scree  plot  shown  in  the  lower  right-hand  window  of 
Fig.  11.3.  The  screeplot  can  be  modified  by  using  the  relative  proportions  on  the 
y- axis,  as  is  shown  in  Fig.  1 1.5  for  the  bank  data  set. 


Table  11.1  Proportion  of 
variance  of  PC’s 


Eigenvalue 

Proportion  of  variance 

Cumulated  proportion 

2.985 

0.67 

0.67 

0.931 

0.21 

0.88 

0.242 

0.05 

0.93 

0.194 

0.04 

0.97 

0.085 

0.02 

0.99 

0.035 

0.01 

1.00 

11.3  Interpretation  of  the  PCs 
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Fig.  11.5  Relative 
proportion  of  variance 
explained  by  PCs  Q 
MVApcabanki 
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The  covariance  between  the  PC  vector  Y  and  the  original  vector  X  is  calculated 
with  the  help  of  (1 1.4)  as  follows: 

Cov(V,  Y)  =  E(XYt)  -EXE  YJ  =  E(VFT) 

=  E(xxTr)  -  nnT  r  =  var(X)r 
=  sr  (ii.i3) 

=  rArTr 

=  TA. 


Hence,  the  correlation,  px{  y]  ,  between  variable  Xt  and  the  PC  Yj  is 


PXiYj  = 


Yij ^ 


J 


X 


1/2 


(<W,)1/2 


=  Yij 


j 


°XiXi 


(11.14) 


Using  actual  data,  this  of  course  translates  into 


r  Xi  y,  =  gij 


l, 


SXiXi 


(11.15) 


The  correlations  can  be  used  to  evaluate  the  relations  between  the  PCs  Yj  where 
j  —  1 , . . . ,  q,  and  the  original  variables  Xt  where  i  =  1 Note  that 


p 


Yp  tz1 

Z^/ =  i  ij 


E 

j= i 


rX,Yj  = 


SXi  Xi 


ij  _  SXjXj 
SX.Xi 


=  1 


(11.16) 
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Fig.  11.6  The  correlation  of 
the  original  variable  with  the 
PCs  Q  MVApcabanki 
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First  PC 


Indeed,  ^2j  =  \  ^jSfj  —  sj £gi  is  the  (z, /)-element  of  the  matrix  QjCQt  —  S ,  so 

that  r\.Y.  may  be  seen  as  the  proportion  of  variance  of  X\  explained  by  Yj . 

In  the  space  of  the  first  two  PCs  we  plot  these  proportions,  i.e.  rXiYl  versus  rx tY2  • 
Figure  11.6  shows  this  for  the  bank  notes  example.  This  plot  shows  which  of  the 
original  variables  are  most  strongly  correlated  with  PC  Y\  and  Y2. 

From  (11.16)  it  obviously  follows  that  r\Yx  +  r\.Yl  <  1  so  that  the  points  are 
always  inside  the  circle  of  radius  1.  In  the  bank  notes  example,  the  variables  X4 ,  X5 
and  X(y  correspond  to  correlations  near  the  periphery  of  the  circle  and  are  thus  well 
explained  by  the  first  two  PCs.  Recall  that  we  have  interpreted  the  first  PC  as  being 
essentially  the  difference  between  X4  and  X§.  This  is  also  reflected  in  Fig.  1 1 .6  since 
the  points  corresponding  to  these  variables  lie  on  different  sides  of  the  vertical  axis. 
An  analogous  remark  applies  to  the  second  PC.  We  had  seen  that  the  second  PC  is 
well  described  by  the  difference  between  X5  and  the  sum  of  X4  and  X§.  Now  we 
are  able  to  see  this  result  again  from  Fig.  11.6  since  the  point  corresponding  to  X5 
lies  above  the  horizontal  axis  and  the  points  corresponding  to  X4  and  X 6  lie  below. 

The  correlations  of  the  original  variables  Xj  and  the  first  two  PCs  are  given 
in  Table  11.2  along  with  the  cumulated  percentage  of  variance  of  each  variable 
explained  by  Y\  and  Y2.  This  table  confirms  the  above  results.  In  particular,  it 
confirms  that  the  percentage  of  variance  of  X\  (and  X2,  X3)  explained  by  the  first 
two  PCs  is  relatively  small  and  so  are  their  weights  in  the  graphical  representation 
of  the  individual  bank  notes  in  the  space  of  the  first  two  PCs  (as  can  be  seen  in 
the  upper  left  plot  in  Fig.  11.3).  Looking  simultaneously  at  Fig.  11.6  and  the  upper 
left  plot  of  Fig.  11.3  shows  that  the  genuine  bank  notes  are  roughly  characterised  by 
large  values  of  X$  and  smaller  values  of  X4.  The  counterfeit  bank  notes  show  larger 
values  of  X5  (see  Example  7.15). 
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Table  11.2  Correlation 
between  the  original  variables 
and  the  PCs 


rxiYx 

rXi  y2 

HU  +  rx,r2 

X\  length 

-0.201 

0.028 

0.041 

X2  left  h. 

0.538 

0.191 

0.326 

X3  right  h. 

0.597 

0.159 

0.381 

X\  lower 

0.921 

-0.377 

0.991 

X5  upper 

0.435 

0.794 

0.820 

X6  diagonal 

-0.870 

-0.410 

0.926 

UW 


'  3L  Summary 


The  weighting  of  the  PCs  tells  us  in  which  directions,  expressed 
in  original  coordinates,  the  best  explanation  of  the  variance  is 
obtained.  Note  that  the  PCs  are  not  scale  invariant. 

A  measure  of  how  well  the  first  q  PCs  explain  variation  is  given 
by  the  relative  proportion  \j/q  =  X//=i  ^j/  Xu=i  •  A  good 
graphical  representation  of  the  ability  of  the  PCs  to  explain  the 
variation  in  the  data  is  the  scree  plot  of  these  proportions. 


The  correlation  between  PC  Yj  and  an  original  variable  Xf  is 


PXi  Yj  =  Yij 

1j4 


(-±-) 

.2 


1/2 


rxiY  = 


For  a  data  matrix  this  translates  into 


rxY  can  interpreted  as  the  proportion  of  variance 

-  J  sXj  X[  i 1  j 

of  Xi  explained  by  Yj.  A  plot  of  rxt y,  vs.  rxt y2  shows  which  of 
the  original  variables  are  most  strongly  correlated  with  the  PCs, 
namely  those  that  are  close  to  the  periphery  of  the  circle  of  radius  1 . 


11.4  Asymptotic  Properties  of  the  PCs 


In  practice,  PCs  are  computed  from  sample  data.  The  following  theorem  yields 
results  on  the  asymptotic  distribution  of  the  sample  PCs. 

Theorem  11.4  Let  E  >  0  with  distinct  eigenvalues,  and  let  U  ~ 
with  spectral  decompositions  E  =  TArT,  andlA  —  QjCQt.  Then 


c 


(a)  *Jm(t  —  A)  — >  A^(0,  2 A2), 

where  l  —  (l\, . . . ,  lp)T  and  X  =  (Ai, 
A, 


,  A P)T  are  the  diagonals  of  C  and 


c 


(h)  y/in(gj  -  yj)  — >  Np(0,Vj), 
with  Vj  =  A TT—X-^ykYk 


k^j 


(At -A W 
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(c)  Cov(gj,gk)  =  Vjk, 

X  j  Xk  Yrk  Ysj 

where  the  (r,  s)-element  of  the  matrix  Vjk(p  x  p)  is - - - 

m(Xj  -X k)2 

(d)  the  elements  in  i  are  asymptotically  independent  of  the  elements  in  Q. 


Example  11.4  Since  nS  ~  WpfE,n  —  1)  if  Xx, . . . ,  Xn  are  drawn  from  N(/i,  S), 
we  have  that 


Vn  -  1(1  j  -  Ay)  -C  N(0, 2Ay),  j=\,...,p.  (11.17) 

Since  the  variance  of  (11.17)  depends  on  the  true  mean  X  j  a  log  transformation 
is  useful.  Consider  f(lj)  —  lo g(lj).  Then  jf: f  \ij=Xj  —  and  by  the 
Transformation  Theorem  4. 1 1  we  have  from  (11.17)  that 

Vn  -  1  (log t j  -  log  Ay)  -C  N(0, 2).  (11.18) 


Hence, 


(log  l  j  -  log  X  j)  ^  N( 0,  1) 

and  a  two-sided  confidence  interval  at  the  1  —  a  —  0.95  significance  level  is  given 
by 


log  V  j )  -  1-96  J  -d—  <  log  A  j  <  log  (lj)  +  1.96  1 


n  —  1 


n  —  1 


In  the  bank  data  example  we  have  that 


lx  =  2.98, 


Therefore, 


log(2.98)  ±  1.96 


=  log(2. 98)  ±0.1965. 


It  can  be  concluded  for  the  true  eigenvalue  that 


P{Ai  G  (2.448,3.62)}  «  0.95. 
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Variance  Explained  by  the  First  q  PCs 


The  variance  explained  by  the  first  q  PCs  is  given  by 


\Jz  — 


Ai  + - b  A, 

p 

Ay 

j= i 


In  practice  this  is  estimated  by 


— 


h  +  •••  +  £ 


q 


p 

£  tj 

j= i 


From  Theorem  11.4  we  know  the  distribution  of  \Jn  —  1  (l  —  A).  Since  x/r  is  a  non¬ 
linear  function  of  A,  we  can  again  apply  the  Transformation  Theorem  4. 1 1  to  obtain 
that 


Vn  -  \(xjr  -f)-^  N(0,  VTVV) 


where  V  =  2  A2  (from  Theorem  11.4)  and  V  —  (d\, ... ,  dp ) 1  with 


T 


d\[f 

dj  —  — —  =  < 

1  dX  i 


1  —  \j/ 

Ecsy 

—if/ 

Re) 


for  1  <  j  <  q, 
for  q  +  1  <  j  <  p. 


Given  this  result,  the  following  theorem  can  be  derived. 

Theorem  11.5 


\/n  —  l(xjr  —  x/s)  N(0,  co2), 


where 


co ‘ 


=  VTVV  = 


MS)}: 


{(1  -  ^)2(A?  + 


*  +  A^)  +  x//2( A^+i  H - h  A'^ 


2  tr(Y?) 


( x/s 2  —  2  fix//  +  /3) 


and 
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Example  11.5  From  Sect.  1 1 .3  it  is  known  that  the  first  PC  for  the  Swiss  bank  notes 
resolves  67  %  of  the  variation.  It  can  be  tested  whether  the  true  proportion  is  actually 
75  %.  Computing 


p 


(2.985)2 

(2.985)2  +  (0.931)2  +  •  •  •  (0.035)2 


0.902 


tr  (S) 
tr(«S2) 


4.472 

P 

= 9-883 
j= i 


2  tr(<52) 

M<S)}2 


.'V  / 

(t 


+  fi) 


2-9.883 

(4.472)2 


{(0.668)2 


2(0.902)  (0.668)  +  0.902}  =  0.142. 


Hence,  a  confidence  interval  at  a  significance  of  level  1  —  a  =  0.95  is  given  by 


/  0.142  , 

0.668  ±  1.96  =  (0.615,0.720) 

Clearly  the  hypothesis  that  \jr  =15%  can  be  rejected! 


U1«J 


'  -ft-  Summary 

The  eigenvalues  lj  and  eigenvectors  gj  are  asymptotically,  nor- 

£ 

mally  distributed,  in  particular  V n  —  1  (l  —  A)  — >  Np( 0, 2A2). 

^  For  the  eigenvalues  it  holds  that  (log lj  —  logAy)  — > 

N(0, 1). 

Given  an  asymptotic,  normal  distribution  approximate  confidence 
intervals  and  tests  can  be  constructed  for  the  proportion  of  variance 
which  is  explained  by  the  first  q  PCs.  The  two-sided  confidence 

interval  at  the  1  —a  =  0.95  level  is  given  by  1  og(f  7 )  — 1.96 
log  Ay  <  log(£y)  +  1.96^. 
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Summary  (continued) 

- * - 

It  holds  for  i/a  the  estimate  of  ^  (the  proportion  of  the  variance 

explained  by  the  first  q  PCs)  that  \/n  —  \(pjf  —  \js)  — >  N(0,  co  ), 
where  co  is  given  in  Theorem  1 1.5. 


11.5  Normalised  Principal  Components  Analysis 

In  certain  situations  the  original  variables  can  be  heterogeneous  w.r.t.  their  vari¬ 
ances.  This  is  particularly  true  when  the  variables  are  measured  on  heterogeneous 
scales  (such  as  years,  kilograms,  dollars,  . . . ).  In  this  case  a  description  of  the 
information  contained  in  the  data  needs  to  be  provided  which  is  robust  w.r.t.  the 
choice  of  scale.  This  can  be  achieved  through  a  standardisation  of  the  variables, 
namely 


Xs  =  HXV~l/2  (11.19) 

where  V  —  diag(sxiXi , . . . ,  sxpxp )•  Note  that  xy  =  0  and  Sxs  —  7£,  the  correlation 
matrix  of  X.  The  PC  transformations  of  the  matrix  Xs  are  referred  to  as  the 
Normalised  Principal  Components  (NPCs).  The  spectral  decomposition  of  7Z  is 

n  =  gnCnGn,  (11.20) 

where  Cn  —  diag(£^, . . . ,  t^)  and  >  •  •  •  >  l ^  are  the  eigenvalues  of  7 Z  with 

corresponding  eigenvectors  g1^ , . . . ,  g ^  (note  that  here  —  tr(72.)  =  p). 

The  NPCs,  Zj ,  provide  a  representation  of  each  individual,  and  is  given  by 

Z  —  XsGn  =  (zi,  •  •  •  ,zp).  (11.21) 

After  transforming  the  variables,  once  again,  we  have  that 

2  =  0,  (11.22) 

Sz  —  GtiSxsG'R,  —  Gn'R'Gn  —  £n-  (11.23) 


The  NPCs  provide  a  perspective  similar  to  that  of  the  PCs,  but  in  terms  of 
the  relative  position  of  individuals,  NPC  gives  each  variable  the  same  weight  (with 
the  PCs  the  variable  with  the  largest  variance  received  the  largest  weight). 


336 


1 1  Principal  Components  Analysis 


Computing  the  covariance  and  correlation  between  Xf  and  Zj  is  straightforward: 

Sxs,z  =  -Xj 2  =  QnCn ,  (11 .24) 

n 

nXs,z  =  Gn£n£"jz^  —  Qn^n  •  (11.25) 

The  correlations  between  the  original  variables  X\  and  the  NPCs  Z7  are: 

rXiZj  =  y/tjgR,ij 

t^  =  1 

;=i 

(compare  this  to  (11.15)  and  (11.16)).  The  resulting  NPCs,  the  Zy- ,  can  be 
interpreted  in  terms  of  the  original  variables  and  the  role  of  each  PC  in  explaining 
the  variation  in  variable  Xj  can  be  evaluated. 


(11.26) 

(11.27) 


11.6  Principal  Components  as  a  Factorial  Method 

The  empirical  PCs  (normalised  or  not)  turn  out  to  be  equivalent  to  the  factors  that 
one  would  obtain  by  decomposing  the  appropriate  data  matrix  into  its  factors  (see 
Chap.  10).  It  will  be  shown  that  the  PCs  are  the  factors  representing  the  rows 
of  the  centred  data  matrix  and  that  the  NPCs  correspond  to  the  factors  of  the 
standardised  data  matrix.  The  representation  of  the  columns  of  the  standardised 
data  matrix  provides  (at  a  scale  factor)  the  correlations  between  the  NPCs  and  the 
original  variables.  The  derivation  of  the  (N)PCs  presented  above  will  have  a  nice 
geometric  justification  here  since  they  are  the  best  fit  in  subspaces  generated  by  the 
columns  of  the  (transformed)  data  matrix  X.  This  analogy  provides  complementary 
interpretations  of  the  graphical  representations  shown  above. 

Assume,  as  in  Chap.  10,  that  we  want  to  obtain  representations  of  the  individuals 
(the  rows  of  X)  and  of  the  variables  (the  columns  of  X)  in  spaces  of  smaller 
dimension.  To  keep  the  representations  simple,  some  prior  transformations  are 
performed.  Since  the  origin  has  no  particular  statistical  meaning  in  the  space  of 
individuals,  we  will  first  shift  the  origin  to  the  centre  of  gravity,  3c,  of  the  point 
cloud.  This  is  the  same  as  analysing  the  centred  data  matrix  Xc  —  T-LX.  Now  all  of 
the  variables  have  zero  means,  thus  the  technique  used  in  Chap.  10  can  be  applied 
to  the  matrix  Xc .  Note  that  the  spectral  decomposition  of  Xj  Xc  is  related  to  that 
of  Sx ,  namely 


XjXc  =  XTHTHX  =  nSx  =  nQCQ1 . 


(11.28) 
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The  factorial  variables  are  obtained  by  projecting  Xc  on  Q, 

y  =  XcG  =  (yi,...,yP).  (H.29) 

These  are  the  same  principal  components  obtained  above,  see  formula  (11.10). 
(Note  that  the  y’s  here  correspond  to  the  z’s  in  Sect.  10.2.)  Since  T-LXc  —  Xc,  it 
immediately  follows  that 

7  =  0,  (11.30) 

Sr  =gTSxg  =  £  =  diag(£1,...Jp).  (11.31) 

The  scatterplot  of  the  individuals  on  the  factorial  axes  are  thus  centred  around  the 
origin  and  are  more  spread  out  in  the  first  direction  (first  PC  has  variance  i\)  than 
in  the  second  direction  (second  PC  has  variance  I2). 

The  representation  of  the  variables  can  be  obtained  using  the  Duality  Rela¬ 
tions  (10.11),  and  (10.12).  The  projections  of  the  columns  of  Xc  onto  the  eigen¬ 
vectors  Vk  of  XcXj  are 

XcVk  =  -X=XjXcgk  =  \fn£kgk-  (11.32) 

Vnlk 

Thus  the  projections  of  the  variables  on  the  first  p  axes  are  the  columns  of  the  matrix 

XjV  =  VnGCl/2.  (11.33) 

Considering  the  geometric  representation,  there  is  a  nice  statistical  interpretation  of 
the  angle  between  two  columns  of  Xc .  Given  that 

xc{j]xc\k]  —  nsXjxk,  (11.34) 

\\xc[j]\\2  =  nsXjXj,  (11.35) 

where  xc\j]  and  xc[k\  denote  the  y-th  and  k- th  column  of  Xc,  it  holds  that  in  the 
full  space  of  the  variables,  if  Ojk  is  the  angle  between  two  variables,  xc[j]  and  xc[k], 
then 


cos  Ojk  = 


XC[j]Xc\k\ 

\\xC[j]\\  \\Xc[k]\\ 


VXjXk • 


(11.36) 


(Example  2.11  shows  the  general  connection  that  exists  between  the  angle  and 
correlation  of  two  variables).  As  a  result,  the  relative  positions  of  the  variables  in 
the  scatterplot  of  the  first  columns  of  Xj  V  may  be  interpreted  in  terms  of  their 
correlations;  the  plot  provides  a  picture  of  the  correlation  structure  of  the  original 
data  set.  Clearly,  one  should  take  into  account  the  percentage  of  variance  explained 
by  the  chosen  axes  when  evaluating  the  correlation. 


338 


1 1  Principal  Components  Analysis 


The  NPCs  can  also  be  viewed  as  a  factorial  method  for  reducing  the  dimension. 
The  variables  are  again  standardised  so  that  each  one  has  mean  zero  and  unit 
variance  and  is  independent  of  the  scale  of  the  variables.  The  factorial  analysis  of 
Xs  provides  the  NPCs.  The  spectral  decomposition  of  Xj  Xs  is  related  to  that  of 
TZ,  namely 


Xjxs  =  V~l/2XTHX  V~l/2  =  n'JZ  =  nQnCnGn- 

The  NPCs  Zj ,  given  by  (11.21),  may  be  viewed  as  the  projections  of  the  rows  of 
Xs  onto  Qr  . 

The  representation  of  the  variables  are  again  given  by  the  columns  of 

XjVn=  yftGndg.  (11.37) 

Comparing  (11.37)  and  (11.25)  we  see  that  the  projections  of  the  variables  in  the 
factorial  analysis  provide  the  correlation  between  the  NPCs  Zk  and  the  original 
variables  xy]  (up  to  the  factor  +Jn  which  could  be  the  scale  of  the  axes). 

This  implies  that  a  deeper  interpretation  of  the  representation  of  the  individuals 
can  be  obtained  by  looking  simultaneously  at  the  graphs  plotting  the  variables.  Note 
that 


xs[j]xs\k]=  nrXjxk,  (11.38) 

Iksmll2  =  «>  (11-39) 

where  xs\j]  and  xs[k]  denote  the  y-th  and  k- th  column  of  Xs.  Hence,  in  the  full 
space,  all  the  standardised  variables  (columns  of  Xs)  are  contained  within  the 
“sphere”  in  W1,  which  is  centred  at  the  origin  and  has  radius  +Jn  (the  scale  of  the 
graph).  As  in  (11.36),  given  the  angle  Ojk  between  two  columns  xsy]  and  xs[k],  it 
holds  that 


COS  Ojk  =  r x j  xk  •  (11.40) 

Therefore,  when  looking  at  the  representation  of  the  variables  in  the  spaces  of 
reduced  dimension  (for  instance  the  first  two  factors),  we  have  a  picture  of  the 
correlation  structure  between  the  original  Xf  ’s  in  terms  of  their  angles.  Of  course, 
the  quality  of  the  representation  in  those  subspaces  has  to  be  taken  into  account, 
which  is  presented  in  the  next  section. 
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Quality  of  the  Representations 

As  said  before,  an  overall  measure  of  the  quality  of  the  representation  is  given  by 

^  i  1  +  ('l  +  *  *  '  +  tq 

7=1 

In  practice,  q  is  chosen  to  be  equal  to  1,  2  or  3.  Suppose  for  instance  that  x/f  —  0.93 
for  q  —  2.  This  means  that  the  graphical  representation  in  two  dimensions  captures 
93  %  of  the  total  variance.  In  other  words,  there  is  minimal  dispersion  in  a  third 
direction  (no  more  than  7  %). 

It  can  be  useful  to  check  if  each  individual  is  well  represented  by  the  PCs.  Clearly, 
the  proximity  of  two  individuals  on  the  projected  space  may  not  necessarily  coincide 
with  the  proximity  in  the  full  original  space  Rp ,  which  may  lead  to  erroneous 
interpretations  of  the  graphs.  In  this  respect,  it  is  worth  computing  the  angle 
between  the  representation  of  an  individual  i  and  the  k- th  PC  or  NPC  axis.  This  can 
be  done  using  (2.40),  i.e. 


cos  &ik  = 


yjek 


biWWek 


ytk 


for  the  PCs  or  analogously 


COS  £;*  = 


zj  ek 


\\zi  ||  \\ek 


Zik 

iFdi 


for  the  NPCs,  where  ek  denotes  the  k- th  unit  vector  ek  —  (0, . . . ,  1, . . . ,  0)T.  An 
individual  i  will  be  represented  on  the  k- th  PC  axis  if  its  corresponding  angle  is 
small,  i.e.  if  cos2  ^  for  k  —  1 , . . . ,  p  is  close  to  one.  Note  that  for  each  individual  i , 


p 

53 cos2  ®ik  = 

k  =  1 


yj  Zi 

xCiXCi 


xJ:QQJ  X  a 


a 


xfra 


The  values  cos2  are  sometimes  called  the  relative  contributions  of  the  k- th  axis 
to  the  representation  of  the  i- th  individual,  e.g.  if  cos2  fin  +  cos2  2  is  large  (near 
one),  we  know  that  the  individual  i  is  well  represented  on  the  plane  of  the  first  two 
principal  axes  since  its  corresponding  angle  with  the  plane  is  close  to  zero. 

We  already  know  that  the  quality  of  the  representation  of  the  variables  can  be 
evaluated  by  the  percentage  of  Xj  ’s  variance  that  is  explained  by  a  PC,  which  is 
given  by  r\_Yj  or  r\.z.  according  to  (11.16)  and  (11.27)  respectively. 
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First  Factor  -  Families 
Fig.  11.7  Representation  of  the  individuals  Q  MVAnpcaf  ood 


Example  11.6  Let  us  return  to  the  French  food  expenditure  example,  see  Sect.  22.6. 
This  yields  a  two-dimensional  representation  of  the  individuals  as  shown  in 
Fig.  11.7. 

Calculating  the  matrix  Qu  we  have 


/ -0.240 

0.622 

-0.011 

-0.466 

0.098 

-0.062 

-0.446 

-0.205 

0.145 

-0.462 

-0.141 

0.207 

-0.438 

-0.197 

0.356 

-0.281 

0.523 

-0.444 

V  0.206 

0.479 

0.780 

-0.544  0.036  0.508  \ 
-0.023  -0.809  -0.301 
0.548  -0.067  0.625 
-0.053  0.411  -0.093 

-0.324  0.224  -0.350 
0.450  0.341  -0.332 
0.306  -0.069  -0.138  / 


which  gives  the  weights  of  the  variables  (milk,  vegetables,  etc.).  The  eigenvalues  ij 
and  the  proportions  of  explained  variance  are  given  in  Table  1 1.3. 

The  interpretation  of  the  principal  components  are  best  understood  when  looking 
at  the  correlations  between  the  original  X;  ’s  and  the  PCs.  Since  the  first  two  PCs 
explain  88.1  %  of  the  variance,  we  limit  ourselves  to  the  first  two  PCs.  The  results 
are  shown  in  Table  11.4.  The  two-dimensional  graphical  representation  of  the 
variables  in  Fig.  1 1.8  is  based  on  the  first  two  columns  of  Table  1 1.4. 

The  plots  are  the  projections  of  the  variables  into  M2.  Since  the  quality  of  the 
representation  is  good  for  all  the  variables  (except  maybe  X7),  their  relative  angles 
give  a  picture  of  their  original  correlation:  wine  is  negatively  correlated  with  the 
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Table  11.3  Eigenvalues  and 
explained  variance 


Eigenvalues 

Proportion  of  variance 

Cumulated  proportion 

4.333 

0.6190 

61.9 

1.830 

0.2620 

88.1 

0.631 

0.0900 

97.1 

0.128 

0.0180 

98.9 

0.058 

0.0080 

99.7 

0.019 

0.0030 

99.9 

0.001 

0.0001 

100.0 

Table  11.4  Correlations 
with  PCs 


rxizl 

rXiZ2 

rXtZi  +  rXi  z2 

X\  \  bread 

-0.499 

0.842 

0.957 

X2 :  vegetables 

-0.970 

0.133 

0.958 

X3 :  fruits 

-0.929 

-0.278 

0.941 

X4\  meat 

-0.962 

-0.191 

0.962 

X5 :  poultry 

-0.911 

-0.266 

0.901 

X6 :  milk 

-0.584 

0.707 

0.841 

X7:  wine 

0.428 

0.648 

0.604 

Fig.  11.8  Representation  of 
the  variables  Q 
MVAnpcaf ood 


_ I _ I _ _ I _ L 

-1  -0.5  0  0.5  1 


First  Factor  -  Goods 


vegetables,  fruits,  meat  and  poultry  groups  ( 0  >  90°),  whereas  taken  individually 
this  latter  grouping  of  variables  are  highly  positively  correlated  with  each  other  (6  % 
0).  Bread  and  milk  are  positively  correlated  but  poorly  correlated  with  meat,  fruits 
and  poultry  ( 6  %  90°). 

Now  the  representation  of  the  individuals  in  Fig.  11.7  can  be  interpreted 
better.  From  Fig.  11.8  and  Table  11.4  we  can  see  that  the  first  factor  Z\  is  a 
vegetable-meat-poultry-fruit  factor  (with  a  negative  sign),  whereas  the  second 
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factor  is  a  milk-bread-wine  factor  (with  a  positive  sign).  Note  that  this  corresponds 
to  the  most  important  weights  in  the  first  columns  of  Qu-  In  Fig.  11.7  lines  were 
drawn  to  connect  families  of  the  same  size  and  families  of  the  same  professional 
types.  A  grid  can  clearly  be  seen  (with  a  slight  deformation  by  the  manager  families) 
that  shows  the  families  with  higher  expenditures  (higher  number  of  children)  on 
the  left. 

Considering  both  figures  together  explains  what  types  of  expenditures  are  respon¬ 
sible  for  similarities  in  food  expenditures.  Bread,  milk  and  wine  expenditures  are 
similar  for  manual  workers  and  employees.  Families  of  managers  are  characterised 
by  higher  expenditures  on  vegetables,  fruits,  meat  and  poultry.  Very  often  when 
analysing  NPCs  (and  PCs),  it  is  illuminating  to  use  such  a  device  to  introduce 
qualitative  aspects  of  individuals  in  order  to  enrich  the  interpretations  of  the  graphs. 


uu 


'  Summary 

^  NPCs  are  PCs  applied  to  the  standardised  (normalised)  data  matrix 
Xs- 

^  The  graphical  representation  of  NPCs  provides  a  similar  type  of 
picture  as  that  of  PCs,  the  difference  being  in  the  relative  position 
of  individuals,  i.e.  each  variable  in  NPCs  has  the  same  weight  (in 
PCs,  the  variable  with  the  largest  variance  has  the  largest  weight). 

^  The  quality  of  the  representation  is  evaluated  by  ^  — 

1  (^i  T  f 2  T - f^)- 

^  The  quality  of  the  representation  of  a  variable  can  be  evaluated  by 
the  percentage  of  Xj  ’s  variance  that  is  explained  by  a  PC,  i.e.  r\.  Y  . 


11.7  Common  Principal  Components 

In  many  applications  a  statistical  analysis  is  simultaneously  done  for  groups  of  data. 
In  this  section  a  technique  is  presented  that  allows  us  to  analyse  group  elements  that 
have  common  PCs.  From  a  statistical  point  of  view,  estimating  PCs  simultaneously 
in  different  groups  will  result  in  a  joint  dimension  reducing  transformation.  This 
multi-group  PCA,  the  so-called  common  principle  components  analysis  (CPCA), 
yields  the  joint  eigenstructure  across  groups. 

In  addition  to  traditional  PCA,  the  basic  assumption  of  CPCA  is  that  the  space 
spanned  by  the  eigenvectors  is  identical  across  several  groups,  whereas  variances 
associated  with  the  components  are  allowed  to  vary. 
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More  formally,  the  hypothesis  of  common  principle  components  can  be  stated  in 
the  following  way  (Flury,  1988): 


Hqpc  :  —  TA, Tt ,  i  —  1, ...  ,k 

where  E*  is  a  positive  definite  p  x  p  population  covariance  matrix  for  every 
i,  r  -  ,yp )  is  an  orthogonal  p  x  p  transformation  matrix  and  A;  = 

diag(A;i, . . .  ,Xip)  is  the  matrix  of  eigenvalues.  Moreover,  assume  that  all  A;  are 
distinct. 

Let  S  be  the  (unbiased)  sample  covariance  matrix  of  an  underlying  p -variate 
normal  distribution  Np(p,  E)  with  sample  size  n.  Then  the  distribution  of  nS  has 
n  —  1  degrees  of  freedom  and  is  known  as  the  Wishart  distribution  (Muirhead,  1982, 

p.  86): 


nS  ~  Wp(T<,n  -  1). 


The  density  is  given  in  (5.16).  Hence,  for  a  given  Wishart  matrix  5/  with  sample 
size  rii ,  the  likelihood  function  can  be  written  as 


L  (Si, . . . ,  SO  =  C ]~ [exp 


/  =  1 


1 


tr  <--(«/  -  1)S- 


-1 


Si 


-Uni-]) 


(11.41) 


where  C  is  a  constant  independent  of  the  parameters  E; .  Maximising  the  likelihood 
is  equivalent  to  minimising  the  function 


K 

g(S  -  l)jlog|X,'|  +  tr(E“1<S,)|. 

/  =  1 


Assuming  that  Hqpc  holds,  i.e.  in  replacing  E?  by  rA?TT,  after  some  manipu¬ 
lations  one  obtains 


k  p  (  yTS-y-\ 

g(  r,Al,...,Ak)  =  y](«,  -  l)y]  I  log  Ay  +  1  '  1  1  . 

7=1  7=1 V  V  / 

As  we  know  from  Sect.  2.2,  the  vectors  yj  in  T  have  to  be  orthogonal. 
Orthogonality  of  the  vectors  yj  is  achieved  using  the  Lagrange  method,  i.e.  we 
impose  the  p  constraints  yj  yj  =  1  using  the  Lagrange  multipliers  /x7 ,  and  the 

remaining  p(p  —  l)/2  constraints  yj  yj  —  0  for  h  ^  j  using  the  multiplier  2 p^j 
(Flury,  1988).  This  yields 


p 


p 


g*(T, Au. Ak)  =  g(-)-J2nj(yJyj  -1)-2J2  l^hjYh  Yj  ■ 

7  =  1  h=  1  y  =hJr\ 
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Taking  partial  derivatives  with  respect  to  all  A?m  and  ym,  it  can  be  shown  that  the 
solution  of  the  CPC  model  is  given  by  the  generalised  system  of  characteristic 
equations 


T  )  -w^im  ^ ij 

Ym  (  2^{Hi  ~ 


i  =  1 


A  /  m  A  ij 


Si  >  Yj  =  0,  m,j  =  m^j. 


(11.42) 


This  system  can  be  solved  using 


A  im  —  Ym  Si  Ym  >  i  —  1,...,A,  TYl  —  1  •>••••>  P 


under  the  constraints 


0 

1 


m  ^  j 

m  =  j 


Flury  (1988)  proves  existence  and  uniqueness  of  the  maximum  of  the  likelihood 
function,  and  Flury  and  Gautschi  (1986)  provide  a  numerical  algorithm. 

Example  11.7  As  an  example  we  provide  the  data  sets  XFGvolsurfOl, 
XFGvolsurf02  and  XFGvolsurf03  that  have  been  used  in  Fengler,  Hardle,  and 
Villa  (2003)  to  estimate  common  principle  components  for  the  implied  volatility 
surfaces  of  the  DAX  1999.  The  data  has  been  generated  by  smoothing  an  implied 
volatility  surface  day  by  day.  Next,  the  estimated  grid  points  have  been  grouped  into 
maturities  of  r  =  1 ,  r  =  2  and  r  =  3  months  and  transformed  into  a  vector  of  time 
series  of  the  “smile”,  i.e.  each  element  of  the  vector  belongs  to  a  distinct  moneyness 
ranging  from  0.85  to  1.10. 

Figure  11.9  shows  the  first  three  eigenvectors  in  a  parallel  coordinate  plot.  The 
basic  structure  of  the  first  three  eigenvectors  is  not  altered.  We  find  a  shift,  a  slope 
and  a  twist  structure.  This  structure  is  common  to  all  maturity  groups,  i.e.  when 
exploiting  PCA  as  a  dimension  reducing  tool,  the  same  transformation  applies  to 
each  group !  However,  by  comparing  the  size  of  eigenvalues  among  groups  we  find 
that  variability  is  decreasing  across  groups  as  we  move  from  the  short-term  contracts 
to  long-term  contracts. 

Before  drawing  conclusions  we  should  convince  ourselves  that  the  CPC  model 
is  truly  a  good  description  of  the  data.  This  can  be  done  by  using  a  likelihood  ratio 
test.  The  likelihood  ratio  statistic  for  comparing  a  restricted  (the  CPC)  model  against 
the  unrestricted  model  (the  model  where  all  covariances  are  treated  separately)  is 
given  by 


Tt 


(77  1  ,772 


L(S\ , . . . ,  Sk ) 
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Fig.  11.9  Factor  loadings  of 
the  first  {thick),  the  second 
{medium),  and  the  third  {thin) 
PC  Q  MVAcpcaiv 


PCP  for  CPCA,  3  eigenvectors 


Inserting  the  likelihood  function,  we  find  that  this  is  equivalent  to 


Tt 


{ni,n2,...,nk)  ~  1) 

i  =  1 


del  (S,) 
del  (Si)' 


which  has  a  x2  distribution  as  min («,  )  tends  to  infinity  with 

ill 

k\2p(p  “  “  \2p(p  ~  ^  +  kp\  =  2^k  ~  ^ 


degrees  of  freedom.  This  test  is  included  in  the  quantlet  O  MVAcpcaiv. 

The  calculations  yield  T(ni m —  31.836,  which  corresponds  to  the  /7-value 
p  —  0.37512  for  the  /2(30)  distribution.  Hence  we  cannot  reject  the  CPC  model 
against  the  unrestricted  model,  where  PC  A  is  applied  to  each  maturity  separately. 

Using  the  methods  in  Sect.  11.3,  we  can  estimate  the  amount  of  variability,  £/, 
explained  by  the  first  /  principal  components:  (only  a  few  factors,  three  at  the 
most,  are  needed  to  capture  a  large  amount  of  the  total  variability  present  in  the 
data).  Since  the  model  now  captures  the  variability  in  both  the  strike  and  maturity 
dimensions,  this  is  a  suitable  starting  point  for  a  simplified  VaR  calculation  for 
delta-gamma  neutral  option  portfolios  using  Monte  Carlo  methods,  and  is  hence 
a  valuable  insight  in  risk  management. 
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11.8  Boston  Housing 

A  set  of  transformations  were  defined  in  Chap.  1  for  the  Boston  Housing  data  set  that 
resulted  in  “regular”  marginal  distributions.  The  usefulness  of  principal  component 
analysis  with  respect  to  such  high-dimensional  data  sets  will  now  be  shown.  The 
variable  X4  is  dropped  because  it  is  a  discrete  0-1  variable.  It  will  be  used  later, 
however,  in  the  graphical  representations.  The  scale  difference  of  the  remaining  13 
variables  motivates  a  NPCA  based  on  the  correlation  matrix. 

The  eigenvalues  and  the  percentage  of  explained  variance  are  given  in 
Table  11.5. 

The  first  principal  component  explains  56  %  of  the  total  variance  and  the  first 
three  components  together  explain  more  than  75%.  These  results  imply  that  it  is 
sufficient  to  look  at  2,  maximum  3,  principal  components. 

Table  11.6  provides  the  correlations  between  the  first  three  PCs  and  the  original 
variables.  These  can  be  seen  in  Fig.  11.10. 

The  correlations  with  the  first  PC  show  a  very  clear  pattern.  The  variables 
X2,  Xb,  Xg,  X\2,  and  Xu  are  strongly  positively  correlated  with  the  first  PC, 
whereas  the  remaining  variables  are  highly  negatively  correlated.  The  minimal 
correlation  in  the  absolute  value  is  0.5.  The  first  PC  axis  could  be  interpreted  as 
a  quality  of  life  and  house  indicator.  The  second  axis,  given  the  polarities  of  Xu 
and  X13  and  of  X 6  and  X14,  can  be  interpreted  as  a  social  factor  explaining  only 
10  %  of  the  total  variance.  The  third  axis  is  dominated  by  a  polarity  between  X2  and 
X12. 

The  set  of  individuals  from  the  first  two  PCs  can  be  graphically  interpreted 
if  the  plots  are  colour  coded  with  respect  to  some  particular  variable  of  interest. 


Table  11.5  Eigenvalues  and 
percentage  of  explained 
variance  for  Boston  Housing 
data  Q  MVAnpcahousi 


Eigenvalue 

Percentages 

Cumulated  percentages 

7.2852 

0.5604 

0.5604 

1.3517 

0.1040 

0.6644 

1.1266 

0.0867 

0.7510 

0.7802 

0.0600 

0.8111 

0.6359 

0.0489 

0.8600 

0.5290 

0.0407 

0.9007 

0.3397 

0.0261 

0.9268 

0.2628 

0.0202 

0.9470 

0.1936 

0.0149 

0.9619 

0.1547 

0.0119 

0.9738 

0.1405 

0.0108 

0.9846 

0.1100 

0.0085 

0.9931 

0.0900 

0.0069 

1.0000 
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Table  11.6  Correlations  of 
the  first  three  PC’s  with  the 
original  variables  Q 
MVAnpcahous 


PC, 

PC2 

PCs 

-0.9076 

0.2247 

0.1457 

*2 

0.6399 

-0.0292 

0.5058 

-0.8580 

0.0409 

-0.1845 

*5 

-0.8737 

0.2391 

-0.1780 

X6 

0.5104 

0.7037 

0.0869 

V 

-0.7999 

0.1556 

-0.2949 

Vi 

0.8259 

-0.2904 

0.2982 

V 

-0.7531 

0.2857 

0.3804 

Vo 

-0.8114 

0.1645 

0.3672 

Xu 

-0.5674 

-0.2667 

0.1498 

X12 

0.4906 

-0.1041 

-0.5170 

xa 

-0.7996 

-0.4253 

-0.0251 

Xu 

0.7366 

0.5160 

-0.1747 

-1  0  1 
First  PC 


-1  0  1 
Third  PC 


-1  0  1 
First  PC 


-1  0  1 


x 


Fig.  11.10  NPCA  for  the  Boston  housing  data,  correlations  of  first  three  PCs  with  the  original 
variables  Q  MVAnpcahous i 


Figure  11.11  colour  codes  Xu  >  median  as  red  points.  Clearly  the  first  and  second 
PCs  are  related  to  house  value.  The  situation  is  less  clear  in  Fig.  11.12  where  the 
colour  code  corresponds  to  X4,  the  Charles  River  indicator,  i.e.  houses  near  the  river 
are  coloured  red. 
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Fig.  11.11  NPC  analysis  for 


First  vs.  Second  PC 


PCI 


Fig.  11.12  NPC  analysis  for 
the  Boston  housing  data, 
scatterplot  of  the  first  two 
PCs.  Houses  close  to  the 
Charles  River  are  indicated 
with  red  squares  Q 
MVAnpcahous 


First  vs.  Second  PC 


PCI 


11.9  More  Examples 

Example  11.8  Let  us  now  apply  the  PCA  to  the  standardised  bank  data  set 
(Sect.  22.2).  Figure  11.13  shows  some  PC  plots  of  the  bank  data  set.  The  genuine 
and  counterfeit  bank  notes  are  marked  by  “o”  and  respectively. 

The  vector  of  eigenvalues  of  7 Z  is 


l  =  (2.946, 1.278, 0.869, 0.450, 0.269, 0.1  89)t. 
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First  vs.  Second  PC 


Second  vs.  Third  PC 


PCI 

First  vs.  Third  PC 


PC2 

Eigenvalues  of  S 


Index 


Fig.  11.13  Principal  components  of  the  standardised  bank  data  Q  MVAnpcabank 


Table  11.7  Eigenvalues  and 
proportions  of  explained 
variance 


f-l 

Proportion  of  variances 

Cumulated  proportion 

2.946 

0.491 

49.1 

1.278 

0.213 

70.4 

0.869 

0.145 

84.9 

0.450 

0.075 

92.4 

0.264 

0.045 

96.9 

0.189 

0.032 

100.0 

The  eigenvectors  gj  are  given  by  the  columns  of  the  matrix 

/ -0.007  -0.815  0.018  0.575  0.059  0.03l\ 

0.468  -0.342  -0.103  -0.395  -0.639  -0.298 
0.487  -0.252  -0.123  -0.430  0.614  0.349 
^  ”  0.407  0.266  -0.584  0.404  0.215  -0.462 

0.368  0.091  0.788  0.110  0.220  -0.419 

V  -0.493  -0.274  -0.114  -0.392  0.340  -0.632  j 

Each  original  variable  has  the  same  weight  in  the  analysis  and  the  results  are 
independent  of  the  scale  of  each  variable. 

The  proportions  of  explained  variance  are  given  in  Table  11.7.  It  can  be 
concluded  that  the  representation  in  two  dimensions  should  be  sufficient.  The 
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Fig.  11.14  The  correlations 
of  the  original  variable  with 
the  PCs  Q  MVAnpcabanki 


-1  -0.5  0  0.5  1 

First  PC 


Table  11.8  Correlations 
with  PCs 


rxiz1 

rXiZ2 

rXiZ\  +  rXiZ2 

X\  \  length 

-0.012 

-0.922 

0.85 

X2  :  left  height 

0.803 

-0.387 

0.79 

X3  :  right  height 

0.835 

-0.285 

0.78 

X4:  lower 

0.698 

0.301 

0.58 

X5  :  upper 

0.631 

0.104 

0.41 

X6:  diagonal 

-0.847 

-0.310 

0.81 

correlations  leading  to  Fig.  11.14  are  given  in  Table  11.8.  The  picture  is  different 
from  the  one  obtained  in  Sect.  11.3  (see  Table  11.2).  Here,  the  first  factor  is  mainly 
a  left-right  vs.  diagonal  factor  and  the  second  one  is  a  length  factor  (with  negative 
weight).  Take  another  look  at  Fig.  11.13,  where  the  individual  bank  notes  are 
displayed.  In  the  upper  left  graph  it  can  be  seen  that  the  genuine  bank  notes  are  for 
the  most  part  in  the  south-eastern  portion  of  the  graph  featuring  a  larger  diagonal, 
smaller  height  (Z\  <  0)  and  also  a  larger  length  (Z2  <  0).  Note  also  that  Fig.  1 1.14 
gives  an  idea  of  the  correlation  structure  of  the  original  data  matrix. 

Example  11.9  Consider  the  data  of  79  US  companies  given  in  Table  22.5.  The  data 
is  first  standardised  by  subtracting  the  mean  and  dividing  by  the  standard  deviation. 
Note  that  the  data  set  contains  six  variables:  assets  (Xi),  sales  (X2),  market  value 
(X3),  profits  (X4),  cash  flow  (X5),  number  of  employees  (X6). 

Calculating  the  corresponding  vector  of  eigenvalues  gives 

i  =  (5.039, 0.517, 0.359,  0.050, 0.029, 0.007)T 
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Fig.  11.15  Principal  components  of  the  US  company  data  Q  MVAnpcausco 


First  vs.  Second  PC 


and  the  matrix  of  eigenvectors  is 

/ 0.340  -0.849  -0.339  0.205  0.077  -0.006  \ 

0.423  -0.170  0.379  -0.783  -0.006  -0.186 
0.434  0.190  -0.192  0.071  -0.844  0.149 
* ~  0.420  0.364  -0.324  0.156  0.261  -0.703  ' 

0.428  0.285  -0.267  -0.121  0.452  0.667 

y  0.397  0.010  0.726  0.548  0.098  0.065/ 

Using  this  information  the  graphical  representations  of  the  first  two  principal 
components  are  given  in  Fig.  11.15.  The  different  sectors  are  marked  by  the 
following  symbols: 

H  ...  Hi  Tech  and  Communication 

E  ...  Energy 

F  ...  Finance 

M  ...  Manufacturing 

R  ...  Retail 

★  ...  all  other  sectors. 

The  two  outliers  in  the  right-hand  side  of  the  graph  are  IBM  and  General  Electric 
(GE),  which  differ  from  the  other  companies  with  their  high  market  values.  As  can 
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First  vs.  Second  PC 


CM 

o 

o_ 


6 

2  4 

E  2 

5  o 


2 

_ i _ 

0 

_ i _ i _ 

2  4 

PC  1 

Eigenvalues  of  S 

_ i _ 

6 

_ i _ 

8 

i 

o 

i 

i 

o 

i 

i  i 

°  o 

i  i 

i 

o 

i 

i 

o  - 

i 

1 

2 

3  4 

5 

6 

Index 


Fig.  11.16  Principal  components  of  the  US  company  data  (without  IBM  and  General  Electric)  Q 
MVAnpcausco2 


be  seen  in  the  first  column  of  Q,  market  value  has  the  largest  weight  in  the  first 
PC,  adding  to  the  isolation  of  these  two  companies.  If  IBM  and  GE  were  to  be 
excluded  from  the  data  set,  a  completely  different  picture  would  emerge,  as  shown 
in  Fig.  1 1.16.  In  this  case  the  vector  of  eigenvalues  becomes 

l  =  (3.191, 1.535, 0.791, 0.292, 0.149,  0.041)t, 

and  the  corresponding  matrix  of  eigenvectors  is 

/ 0.263  -0.408  -0.800  -0.067  0.333  0.099\ 

0.438  -0.407  0.162  -0.509  -0.441  -0.403 
0.500  -0.003  -0.035  0.801  -0.264  -0.190 
0.331  0.623  -0.080  -0.192  0.426  -0.526 

0.443  0.450  -0.123  -0.238  -0.335  0.646 
y  0.427  -0.277  0.558  0.021  0.575  0.313  j 

The  percentage  of  variation  explained  by  each  component  is  given  in  Table  1 1.9. 
The  first  two  components  explain  almost  79  %  of  the  variance.  The  interpretation  of 
the  factors  (the  axes  of  Fig.  1 1 . 16)  is  given  in  the  table  of  correlations  (Table  11.10). 
The  first  two  columns  of  this  table  are  plotted  in  Fig.  11.17. 
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Table  11.9  Eigenvalues  and 
proportions  of  explained 
variance 


Table  11.10  Correlations 
with  PCs 


f-i 

Proportion  of  variance 

Cumulated  proportion 

3.191 

0.532 

0.532 

1.535 

0.256 

0.788 

0.791 

0.132 

0.920 

0.292 

0.049 

0.968 

0.149 

0.025 

0.993 

0.041 

0.007 

1.000 

rXiZx 

rXiZ2 

rXtZ\  +  rXiZ2 

X\ :  assets 

0.47 

-0.510 

0.48 

X2:  sales 

0.78 

-0.500 

0.87 

X3 :  market  value 

0.89 

-0.003 

0.80 

X4:  profits 

0.59 

0.770 

0.95 

X5  :  cash  flow 

0.79 

0.560 

0.94 

X6:  employees 

0.76 

-0.340 

0.70 

-1  -0.5  0  0.5  1 

First  PC 


Fig.  11.17  The  correlation  of  the  original  variables  with  the  PCs  Q  MVAnpcausco2i 


From  Fig.  11.17  (and  Table  1 1 . 10)  it  appears  that  the  first  factor  is  a  “size  effect”, 
it  is  positively  correlated  with  all  the  variables  describing  the  size  of  the  activity  of 
the  companies.  It  is  also  a  measure  of  the  economic  strength  of  the  firms.  The  second 
factor  describes  the  “shape”  of  the  companies  (“profit-cash  flow”  vs.  “assets-sales” 
factor),  which  is  more  difficult  to  interpret  from  an  economic  point  of  view. 
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Example  11.10  Voile  (1985)  analyses  data  on  28  individuals  (Table  22.14).  For 
each  individual,  the  time  spent  (in  hours)  on  10  different  activities  has  been  recorded 
over  100  days,  as  well  as  informative  statistics  such  as  the  individual’s  sex,  country 
of  residence,  professional  activity  and  matrimonial  status.  The  results  of  a  NPCA 
are  given  below. 

The  eigenvalues  of  the  correlation  matrix  are  given  in  Table  1 1 . 1 1 .  Note  that  the 
last  eigenvalue  is  exactly  zero  since  the  correlation  matrix  is  singular  (the  sum  of  all 
the  variables  is  always  equal  to  2,400  =  24  x  100).  The  results  of  the  4  first  PCs  are 
given  in  Tables  11.12  and  11.13. 

From  these  tables  (and  Figs.  11.18  and  11.19),  it  appears  that  the  professional 
and  household  activities  are  strongly  contrasted  in  the  first  factor.  Indeed  on  the 
horizontal  axis  of  Fig.  11.18  it  can  be  seen  that  all  the  active  men  are  on  the  right 
and  all  the  inactive  women  are  on  the  left.  Active  women  and/or  single  women  are 
in  between.  The  second  factor  contrasts  meal/sleeping  vs.  toilet/shopping  (note  the 
high  correlation  between  meal  and  sleeping).  Along  the  vertical  axis  of  Fig.  11.18 
we  see  near  the  bottom  of  the  graph  the  people  from  Western-European  countries, 
who  spend  more  time  on  meals  and  sleeping  than  people  from  the  US  (who  can  be 
found  close  to  the  top  of  the  graph).  The  other  categories  are  in  between. 


Table  11.11  Eigenvalues  of 
correlation  matrix  for  the 
time  budget  data 


tj 

Proportion  of  variance 

Cumulated  proportion 

4.59 

0.459 

0.460 

2.12 

0.212 

0.670 

1.32 

0.132 

0.800 

1.20 

0.120 

0.920 

0.47 

0.047 

0.970 

0.20 

0.020 

0.990 

0.05 

0.005 

0.990 

0.04 

0.004 

0.999 

0.02 

0.002 

1.000 

0.00 

0.000 

1.000 

r  Xi  Wi 

r  Xi  W2 

r  Xi  w3 

r  Xi  w4 

X\ 

prof 

0.9772 

-0.1210 

-0.0846 

0.0669 

x2 

tran 

0.9798 

0.0581 

-0.0084 

0.4555 

x3 

hous 

-0.8999 

0.0227 

0.3624 

0.2142 

X4 

kids 

-0.8721 

0.1786 

0.0837 

0.2944 

X5 

shop 

-0.5636 

0.7606 

-0.0046 

-0.1210 

x6 

pers 

-0.0795 

0.8181 

-0.3022 

-0.0636 

Xi 

eati 

-0.5883 

-0.6694 

-0.4263 

0.0141 

Xs 

slee 

-0.6442 

-0.5693 

-0.1908 

-0.3125 

x9 

tele 

-0.0994 

0.1931 

-0.9300 

0.1512 

Xw: 

leis 

-0.0922 

0.1103 

0.0302 

-0.9574 

Table  11.12  Correlation  of 
variables  with  PCs 
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Table  11.13  PCs  for  time 
budget  data 


Zi 

z2 

Z3 

z4 

maus 

0.0633 

0.0245 

-0.0668 

0.0205 

waus 

0.0061 

0.0791 

-0.0236 

0.0156 

wnus 

-0.1448 

0.0813 

-0.0379 

-0.0186 

mmus 

0.0635 

0.0105 

-0.0673 

0.0262 

wmus 

-0.0934 

0.0816 

-0.0285 

0.0038 

msus 

0.0537 

0.0676 

-0.0487 

-0.0279 

wsus 

0.0166 

0.1016 

-0.0463 

-0.0053 

mawe 

0.0420 

-0.0846 

-0.0399 

-0.0016 

wawe 

-0.0111 

-0.0534 

-0.0097 

0.0337 

wnwe 

-0.1544 

-0.0583 

-0.0318 

-0.0051 

mmwe 

0.0402 

-0.0880 

-0.0459 

0.0054 

wmwe 

-0.1118 

-0.0710 

-0.0210 

0.0262 

mswe 

0.0489 

-0.0919 

-0.0188 

-0.0365 

wswe 

-0.0393 

-0.0591 

-0.0194 

-0.0534 

mayo 

0.0772 

-0.0086 

0.0253 

-0.0085 

wayo 

0.0359 

0.0064 

0.0577 

0.0762 

wnyo 

-0.1263 

-0.0135 

0.0584 

-0.0189 

mmyo 

0.0793 

-0.0076 

0.0173 

-0.0039 

wmyo 

-0.0550 

-0.0077 

0.0579 

0.0416 

msyo 

0.0763 

0.0207 

0.0575 

-0.0778 

wsyo 

0.0120 

0.0149 

0.0532 

-0.0366 

maes 

0.0767 

-0.0025 

0.0047 

0.0115 

waes 

0.0353 

0.0209 

0.0488 

0.0729 

wnes 

-0.1399 

0.0016 

0.0240 

-0.0348 

mmes 

0.0742 

-0.0061 

-0.0152 

0.0283 

wmes 

-0.0175 

0.0073 

0.0429 

0.0719 

mses 

0.0903 

0.0052 

0.0379 

-0.0701 

fses 

0.0020 

0.0287 

0.0358 

-0.0346 

In  Fig.  11.19  the  variables  television  and  other  leisure  activities  hardly  play 
any  role  (look  at  Table  11.12).  The  variable  television  appears  in  Z3  (negatively 
correlated).  Table  11.13  shows  that  this  factor  contrasts  people  from  Eastern 
countries  and  Yugoslavia  with  men  living  in  the  US  The  variable  other  leisure 
activities  is  the  factor  Z4.  It  merely  distinguishes  between  men  and  women  in 
Eastern  countries  and  in  Yugoslavia.  These  last  two  factors  are  orthogonal  to 
the  preceding  axes  and  of  course  their  contribution  to  the  total  variation  is  less 
important. 
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Fig.  11.18  Representation  of  the  individuals  Q  MVAnpcatime 
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Fig.  11.19  Representation  of  the  variables  Q  MVAnpcatime 
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11.10  Exercises 


Exercise  11.1  Prove  Theorem  ILL  (Hint:  use  (4.23).) 


Exercise  11.2  Interpret  the  results  of  the  PCA  of  the  US  companies.  Use  the 
analysis  of  the  bank  notes  in  Sect.  11.3  as  a  guide.  Compare  your  results  with  those 
in  Example  11.9. 

Exercise  11.3  Test  the  hypothesis  that  the  proportion  of  variance  explained  by  the 
first  two  PCs  for  the  US  companies  is  xf  =  0.75. 

Exercise  11.4  Apply  the  PCA  to  the  car  data  (Table  22.7).  Interpret  the  first  two 
PCs.  Would  it  be  necessary  to  look  at  the  third  PC? 


Exercise  11.5  Take  the  athletic  records  for  55  countries  ( Sect.  22.18)  and  apply  the 
NPCA.  Interpret  your  results. 


Exercise  11.6  Apply  a  PCA  to  51  — 


1  p 

P  1 


|,  where  p  >  0.  Now  change  the  scale 


ofX\,  i.e.  consider  the  covariance  of  cX\  and  X2.  How  do  the  PC  directions  change 
with  the  screeplot ? 


Exercise  11.7  Suppose  that  we  have  standardised  some  data  using  the  Maha- 
lanobis  transformation.  Would  it  be  reasonable  to  apply  a  PCA  ? 

Exercise  11.8  Apply  a  NPCA  to  the  US  CRIME  data  set  (Table  22.10).  Interpret  the 
results.  Would  it  be  necessary  to  look  at  the  third  PC?  Can  you  see  any  difference 
between  the  four  regions?  Redo  the  analysis  excluding  the  variable  <(area  of  the 
state  ”. 


Exercise  11.9  Repeat  Exercise  11.8  using  the  US  HEALTH  data  set  (Table  22.16). 

Exercise  11.10  Do  a  NPCA  on  the  GEOPOL  data  set  (see  Table  22.15)  which 
compares  41  countries  w.r.t.  different  aspects  of  their  development.  Why  or  why 
not  would  a  PCA  be  reasonable  here  ? 

Exercise  11.11  Let  U  be  an  uniform  r.v.  on  [0,1].  Let  a  e  M3  be  a  vector  of 
constants.  Suppose  that  X  =  UaT  —  (X\,  X2,  Xf).  What  do  you  expect  the  NPCs 
of  X  to  be? 

Exercise  11.12  Let  U\  and  U2  be  two  independent  uniform  random  variables  on 
[0,1].  Suppose  that  X  —  (X\,  X2,  X3,  Xf)T  where  X\  —  U\,  X2  —  U2,  X3  — 
U\  +  U2  and  X4  =  U\  —  U2.  Compute  the  correlation  matrix  P  of  X.  How  many 

1  1 

\fi  ’  V2 

are  eigenvectors  of  P  corresponding  to  the  non  trivial  X  ‘s.  Interpret  the  first  two 
NPCs  obtained. 


PCs  are  of  interest?  Show  that  y\  =  ^ 


,  1 , 0)  and  Y2  ^  ’  ^j~2  ’  ^ 


T 


Exercise  11.13  Simulate  a  sample  of  size  n  —  50  for  the  r.v.  X  in  Exercise  11.12 
and  analyse  the  results  of  a  NPCA. 
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Exercise  11.14  Bouroche  and  Saporta  (1980)  reported  the  data  on  the  state 
expenses  of  France  from  the  period  1872  to  1971  (24  selected  years)  by  noting  the 
percentage  of  11  categories  of  expenses.  Do  a  NPCA  of  this  data  set.  Do  the  three 
main  periods  (before  WWI,  between  WWI  and  WWII,  and  after  WWII)  indicate  a 
change  in  behaviour  w.r.t.  state  expenses? 


Chapter  12 

Factor  Analysis 


A  frequently  applied  paradigm  in  analysing  data  from  multivariate  observations  is  to 
model  the  relevant  information  (represented  in  a  multivariate  variable  X)  as  coming 
from  a  limited  number  of  latent  factors.  In  a  survey  on  household  consumption,  for 
example,  the  consumption  levels,  X,  of  p  different  goods  during  1  month  could  be 
observed.  The  variations  and  covariations  of  the  p  components  of  X  throughout  the 
survey  might  in  fact  be  explained  by  two  or  three  main  social  behaviour  factors 
of  the  household.  For  instance,  a  basic  desire  of  comfort  or  the  willingness  to 
achieve  a  certain  social  level  or  other  social  latent  concepts  might  explain  most  of 
the  consumption  behaviour.  These  unobserved  factors  are  much  more  interesting  to 
the  social  scientist  than  the  observed  quantitative  measures  (X)  themselves,  because 
they  give  a  better  understanding  of  the  behaviour  of  households.  As  shown  in  the 
examples  below,  the  same  kind  of  factor  analysis  is  of  interest  in  many  fields  such 
as  psychology,  marketing,  economics,  and  politic  sciences. 

How  can  we  provide  a  statistical  model  addressing  these  issues  and  how 
can  we  interpret  the  obtained  model?  This  is  the  aim  of  factor  analysis.  As  in 
Chaps.  10  and  11,  the  driving  statistical  theme  of  this  chapter  is  to  reduce  the 
dimension  of  the  observed  data.  The  perspective  used,  however,  is  different:  we 
assume  that  there  is  a  model  (it  will  be  called  the  “Factor  Model”)  stating  that  most 
of  the  covariances  between  the  p  elements  of  X  can  be  explained  by  a  limited 
number  of  latent  factors.  Section  12.1  defines  the  basic  concepts  and  notations 
of  the  orthogonal  factor  model,  stressing  the  non-uniqueness  of  the  solutions.  We 
show  how  to  take  advantage  of  this  non-uniqueness  to  derive  techniques  which 
lead  to  easier  interpretations.  This  will  involve  (geometric)  rotations  of  the  factors. 
Section  12.2  presents  an  empirical  approach  to  factor  analysis.  Various  estimation 
procedures  are  proposed  and  an  optimal  rotation  procedure  is  defined.  Many 
examples  are  used  to  illustrate  the  method. 
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12  Factor  Analysis 


12.1  The  Orthogonal  Factor  Model 

The  aim  of  factor  analysis  is  to  explain  the  outcome  of  p  variables  in  the  data 
matrix  X  using  fewer  variables,  the  so-called  factors.  Ideally  all  the  information  in 
X  can  be  reproduced  by  a  smaller  number  of  factors.  These  factors  are  interpreted 
as  latent  (unobserved)  common  characteristics  of  the  observed  x  e  Rp .  The  case 
just  described  occurs  when  every  observed  x  =  (x\, . . . ,  xp)T  can  be  written  as 

k 

Xj  =  +  /V>  j  =  (12.1) 

e=\ 

Here  fy,  for  l  =  1 , ,k  denotes  the  factors.  The  number  of  factors,  k,  should 
always  be  much  smaller  than  p.  For  instance,  in  psychology  x  may  represent  p 
results  of  a  test  measuring  intelligence  scores.  One  common  latent  factor  explaining 
x  e  could  be  the  overall  level  of  “intelligence”.  In  marketing  studies,  x  may 
consist  of  p  answers  to  a  survey  on  the  levels  of  satisfaction  of  the  customers.  These 
p  measures  could  be  explained  by  common  latent  factors  like  the  attraction  level  of 
the  product  or  the  image  of  the  brand,  and  so  on.  Indeed  it  is  possible  to  create  a 
representation  of  the  observations  that  is  similar  to  the  one  in  (12.1)  by  means  of 
principal  components,  but  only  if  the  last  p  —  k  eigenvalues  corresponding  to  the 
covariance  matrix  are  equal  to  zero.  Consider  a  -dimensional  random  vector  X 
with  mean  p  and  covariance  matrix  Var(X)  =  X.  A  model  similar  to  (12.1)  can  be 
written  for  X  in  matrix  notation,  namely 

X  —  QF  +  p,  (12.2) 

where  F  is  the  A -dimensional  vector  of  the  k  factors.  When  using  the  factor 
model  (12.2)  it  is  often  assumed  that  the  factors  F  are  centred,  uncorrelated  and 
standardised:  E(F)  =  0  and  Var(i7)  =  2^.  We  will  now  show  that  if  the  last 
p  —  k  eigenvalues  of  X  are  equal  to  zero,  we  can  easily  express  X  by  the  factor 
model  (12.2). 

The  spectral  decomposition  of  X  is  given  by  TAT T.  Suppose  that  only  the  first  k 
eigenvalues  are  positive,  i.e.  A^+i  =  •  •  •  =  Xp  =  0.  Then  the  (singular)  covariance 
matrix  can  be  written  as 


In  order  to  show  the  connection  to  the  factor  model  (12.2),  recall  that  the  PCs  are 
given  by  Y  —  Tt(X  —  p).  Rearranging  we  have  X  —  p  —  TT  =  T\Yi  +  T2Y2, 
where  the  components  of  Y  are  partitioned  according  to  the  partition  of  T  above, 
namely 
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fv)  (*- (A  Where  (  fy  ) 


In  other  words,  Y \  has  a  singular  distribution  with  mean  and  covariance  matrix  equal 
to  zero.  Therefore,  X  —  /x  =  T\Y\  +  T2Y2  implies  that  X  —  p  is  equivalent  to  T\  Y\ , 
which  can  be  written  as 


X  =  TiAj^A;  1/2Fi  +  IX. 

Defining  Q  —  r i  A  ]  2  and  F  =  A  {  1/2  K| ,  we  obtain  the  factor  model  (12.2). 

Note  that  the  covariance  matrix  of  model  (12.2)  can  be  written  as 

k 

£  =  E (X-n)(X-ii)T  =  QE(FFt)Qt  =  QQt  =  jyjyJ-  (12-3> 

j= 1 

We  have  just  shown  how  the  variable  X  can  be  completely  determined  by  a  weighted 
sum  of  k  (where  k  <  p)  uncorrelated  factors.  The  situation  used  in  the  derivation, 
however,  is  too  idealistic.  In  practice  the  covariance  matrix  is  rarely  singular. 

It  is  a  common  praxis  in  factor  analysis  to  split  the  influences  of  the  factors  into 
common  and  specific  ones.  There  are,  for  example,  highly  informative  factors  that 
are  common  to  all  of  the  components  of  X  and  factors  that  are  specific  to  certain 
components.  The  factor  analysis  model  used  in  praxis  is  a  generalisation  of  (12.2): 

X  =  QF  +  U  +  /x,  (12.4) 

where  Q  is  a  (p  x  k)  matrix  of  the  (non-random)  loadings  of  the  common  factors 
F(k  x  1)  and  U  is  a  (p  x  1)  matrix  of  the  (random)  specific  factors.  It  is  assumed  that 
the  factor  variables  F  are  uncorrelated  random  vectors  and  that  the  specific  factors 
are  uncorrelated  and  have  zero  covariance  with  the  common  factors.  More  precisely, 
it  is  assumed  that: 


EF  =  0, 

Var  (F)=lk, 

EU  =  0, 

Cov(C/,  ,  Uj)  —  0,  ijtj 
Co v(F,  U)  =  0. 


(12.5) 


Define 


Var(C7)  =  =  diag(i/fn, . . . ,  Vv ) • 


362 


12  Factor  Analysis 


The  generalised  factor  model  (12.4)  together  with  the  assumptions  given  in  (12.5) 
constitute  the  orthogonal  factor  model. 


Orthogonal  Factor  Model 

X  =  Q 

F 

+  U  +  /x 

(P  x  1)  (pxk) 

(k  x  1) 

(p  X  1)  (p  X  1) 

lij 

— 

mean  of  variable  j 

Uj 

— 

j  th  specific  factor 

Ft 

— 

t th  common  factor 

<ljt 

— 

loading  of  the  j  th  variable  on  the  l  th  factor 

The  random  vectors  F  and  U  are  unobservable  and  uncorrelated. 

Note  that  (12.4)  implies  for  the  components  of  X  —  (X\, . . . ,  X^)T  that 

k 

Xj^qjiFt  +  Uj+pj,  j  =  l,...,p.  (12.6) 

i=  l 

Using  (12.5)  we  obtain  crxjXj  =  Var (Xj)  —  +  VU-  The  quantity 

h2-  =  Yi=i  Q2ji  is  called  the  communality  and  xj/jj  the  specific  variance.  Thus  the 
covariance  of  X  can  be  rewritten  as 

E  =  E(Z  -  fi)(X  -  ti)1  =  E (QF  +  U)(QF  +  U)T 
=  QE(FFt)Qt  +  E  (UUT)  =  QVar(F)QT  +  Var(C7) 

=  QQT  +  S'.  (12.7) 

In  a  sense,  the  factor  model  explains  the  variations  of  X  for  the  most  part  by  a  small 
number  of  latent  factors  F  common  to  its  p  components  and  entirely  explains  all 
the  correlation  structure  between  its  components,  plus  some  “noise”  U  which  allows 
specific  variations  of  each  component  to  enter.  The  specific  factors  adjust  to  capture 
the  individual  variance  of  each  component.  Factor  analysis  relies  on  the  assumptions 
presented  above.  If  the  assumptions  are  not  met,  the  analysis  could  be  spurious. 
Although  principal  components  analysis  and  factor  analysis  might  be  related  (this 
was  hinted  at  in  the  derivation  of  the  factor  model),  they  are  quite  different  in  nature. 
PCs  are  linear  transformations  of  X  arranged  in  decreasing  order  of  variance  and 
used  to  reduce  the  dimension  of  the  data  set,  whereas  in  factor  analysis,  we  try  to 
model  the  variations  of  X  using  a  linear  transformation  of  a  fixed,  limited  number 
of  latent  factors.  The  objective  of  factor  analysis  is  to  find  the  loadings  Q  and 
the  specific  variance  tp.  Estimates  of  Q  and  \P  are  deduced  from  the  covariance 
structure  (12.7). 
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Interpretation  of  the  Factors 

Assume  that  a  factor  model  with  k  factors  was  found  to  be  reasonable,  i.e.  most 
of  the  (co)variations  of  the  p  measures  in  X  were  explained  by  the  k  fixed  latent 
factors.  The  next  natural  step  is  to  try  to  understand  what  these  factors  represent.  To 
interpret  Fi,  it  makes  sense  to  compute  its  correlations  with  the  original  variables 
Xj  first.  This  is  done  for  l  —  1 , . . . ,  k  and  for  j  —  1 , . . . ,  p  to  obtain  the  matrix 
Pxf •  The  sequence  of  calculations  used  here  are  in  fact  the  same  that  were  used  to 
interpret  the  PCs  in  the  principal  components  analysis. 

The  following  covariance  between  X  and  F  is  obtained  via  (12.5), 

-  E{(QF  +  U)Ft}  =  Q. 


The  correlation  is 


PXF  =  D~l/2Q,  (12.8) 

where  D  =  diag^j^xi ,  • . . ,  (?xpxp)-  Using  (12.8)  it  is  possible  to  construct  a 
figure  analogous  to  Fig.  11.6  and  thus  to  consider  which  of  the  original  variables 
X\ , ,Xp  play  a  role  in  the  unobserved  common  factors  F\ , . . . ,  F^. 

Returning  to  the  psychology  example  where  X  are  the  observed  scores  to  p 
different  intelligence  tests  (the  WAIS  data  set  in  Table  22.12  provides  an  example), 
we  would  expect  a  model  with  one  factor  to  produce  a  factor  that  is  positively 
correlated  with  all  of  the  components  in  X .  For  this  example  the  factor  represents 
the  overall  level  of  intelligence  of  an  individual.  A  model  with  two  factors  could 
produce  a  refinement  in  explaining  the  variations  of  the  p  scores.  For  example, 
the  first  factor  could  be  the  same  as  before  (overall  level  of  intelligence),  whereas 
the  second  factor  could  be  positively  correlated  with  some  of  the  tests,  Xj ,  that 
are  related  to  the  individual’s  ability  to  think  abstractly  and  negatively  correlated 
with  other  tests,  Xj,  that  are  related  to  the  individual’s  practical  ability.  The  second 
factor  would  then  concern  a  particular  dimension  of  the  intelligence  stressing  the 
distinctions  between  the  “theoretical”  and  “practical”  abilities  of  the  individual. 
If  the  model  is  true,  most  of  the  information  coming  from  the  p  scores  can  be 
summarised  by  these  two  latent  factors.  Other  practical  examples  are  given  below. 


Invariance  of  Scale 

What  happens  if  we  change  the  scale  of  X  to  Y  —  CX  with  C  —  diag(ci , . . . ,  cp)l 
If  the  k -factor  model  (12.6)  is  true  for  X  with  Q  —  Qx,  d1  =  dfY,  then,  since 


Var(F)  =  CXCT  =  CQxQjCT  +  CVXCT, 
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the  same  k- factor  model  is  also  true  for  Y  with  Qy  —  CQx  and  VPy  =  C^?xCT . 
In  many  applications,  the  search  for  the  loadings  Q  and  for  the  specific  variance 
tF  will  be  done  by  the  decomposition  of  the  correlation  matrix  of  X  rather  than  the 
covariance  matrix  E.  This  corresponds  to  a  factor  analysis  of  a  linear  transformation 
of  X  (i.e.  Y  =  D~[/2(X  —  {i)).  The  goal  is  to  try  to  find  the  loadings  Qy  and  the 
specific  variance  tpy  such  that 


P  =  Qr  Qj  +  yY.  (12.9) 

In  this  case  the  interpretation  of  the  factors  F  immediately  follows  from  (12.8)  given 
the  following  correlation  matrix: 

P xf  —  Pyf  —  Qy •  (12.10) 

Because  of  the  scale  invariance  of  the  factors,  the  loadings  and  the  specific  variance 
of  the  model,  where  X  is  expressed  in  its  original  units  of  measure,  are  given  by 

Qx  —  D1/2Qy 
=  Dl/2^YDl/1. 

It  should  be  noted  that  although  the  factor  analysis  model  (12.4)  enjoys  the  scale 
invariance  property,  the  actual  estimated  factors  could  be  scale  dependent.  We  will 
come  back  to  this  point  later  when  we  discuss  the  method  of  principal  factors. 


Non-uniqueness  of  Factor  Loadings 

The  factor  loadings  are  not  unique!  Suppose  that  Q  is  an  orthogonal  matrix.  Then  X 
in  (12.4)  can  also  be  written  as 

X  =  (QQ)(gT  F)  +  U  +  fi. 

This  implies  that,  if  a  k -factor  of  X  with  factors  F  and  loadings  Q  is  true,  then 
the  ^-factor  model  with  factors  QT  F  and  loadings  QQ  is  also  true.  In  practice,  we 
will  take  advantage  of  this  non-uniqueness.  Indeed,  referring  back  to  Sect.  2.6  we 
can  conclude  that  premultiplying  a  vector  F  by  an  orthogonal  matrix  corresponds 
to  a  rotation  of  the  system  of  axis,  the  direction  of  the  first  new  axis  being  given  by 
the  first  row  of  the  orthogonal  matrix.  It  will  be  shown  that  choosing  an  appropriate 
rotation  will  result  in  a  matrix  of  loadings  QQ  that  will  be  easier  to  interpret.  We 
have  seen  that  the  loadings  provide  the  correlations  between  the  factors  and  the 
original  variables;  therefore,  it  makes  sense  to  search  for  rotations  that  give  factors 
that  are  maximally  correlated  with  various  groups  of  variables. 
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From  a  numerical  point  of  view,  the  non-uniqueness  is  a  drawback.  We  have 
to  find  loadings  Q  and  specific  variances  tp  satisfying  the  decomposition  E  = 
QQt  +  VF,  but  no  straightforward  numerical  algorithm  can  solve  this  problem  due 
to  the  multiplicity  of  the  solutions.  An  acceptable  technique  is  to  impose  some 
chosen  constraints  in  order  to  get — in  the  best  case — an  unique  solution  to  the 
decomposition.  Then,  as  suggested  above,  once  we  have  a  solution  we  will  take 
advantage  of  the  rotations  in  order  to  obtain  a  solution  that  is  easier  to  interpret. 

An  obvious  question  is:  what  kind  of  constraints  should  we  impose  in  order  to 
eliminate  the  non-uniqueness  problem?  Usually,  we  impose  additional  constraints 
where 


QT\j/_1Q  is  diagonal  (12.11) 

or 

QTV~lQ  is  diagonal.  (12.12) 

How  many  parameters  does  the  model  (12.7)  have  without  constraints? 

Q(p  x  k )  has  p  •  k  parameters,  and 

i Kp  x  p )  has  p  parameters. 

Hence  we  have  to  determine  pk  +  p  parameters!  Conditions  (12.11)  respec¬ 
tively  (12.12)  introduce  |{k(k  —  1)}  constraints,  since  we  require  the  matrices  to 
be  diagonal.  Therefore,  the  degrees  of  freedom  of  a  model  with  k  factors  is: 

d  —  (#  parameters  for  E  unconstrained)  —  (#  parameters  for  E  constrained) 

=  \p(p  +  1  )-(pk  +  p-  \k(k  -  1)) 

=  3 (p-k)2-  \{p  +  k). 

If  d  <0,  then  the  model  is  undetermined:  there  are  infinitely  many  solutions 
to  (12.7).  This  means  that  the  number  of  parameters  of  the  factorial  model  is  larger 
than  the  number  of  parameters  of  the  original  model,  or  that  the  number  of  factors 
k  is  “too  large”  relative  to  p.  In  some  cases  d  —  0:  there  is  a  unique  solution  to 
the  problem  (except  for  rotation).  In  practice  we  usually  have  that  d  >  0:  there  are 
more  equations  than  parameters,  thus  an  exact  solution  does  not  exist.  In  this  case 
approximate  solutions  are  used.  An  approximation  of  E,  for  example,  is  QQT  +  VF. 
The  last  case  is  the  most  interesting  since  the  factorial  model  has  less  parameters 
than  the  original  one.  Estimation  methods  are  introduced  in  the  next  section. 

Evaluating  the  degrees  of  freedom,  d ,  is  particularly  important,  because  it  already 
gives  an  idea  of  the  upper  bound  on  the  number  of  factors  we  can  hope  to  identify  in 
a  factor  model.  For  instance,  if  p  —  4,  we  could  not  identify  a  factor  model  with  two 
factors  (this  results  in  d  —  —1  which  has  infinitely  many  solutions).  With  p  —  4, 
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only  a  one  factor  model  gives  an  approximate  solution  ( d  —  2).  When  p  —  6, 
models  with  1  and  2  factors  provide  approximate  solutions  and  a  model  with  three 
factors  results  in  an  unique  solution  (up  to  the  rotations)  since  d  —  0.  A  model  with 
four  or  more  factors  would  not  be  allowed,  but  of  course,  the  aim  of  factor  analysis 
is  to  find  suitable  models  with  a  small  number  of  factors,  i.e.  smaller  than  p.  The 
next  two  examples  give  more  insights  into  the  notion  of  degrees  of  freedom. 

Example  12.1  Let  p  —  3  and  k  —  1,  then  d  —  0  and 

(CTn  <712  /<7i+lAll  <7l<72  <7l<?3 

021  0-22  O23  I  ~  I  <7l<?2  q\  +  fl2  ?2<?3 

031  O32  O33  /  \  qiqi  qiqi  qj  +  fs3 


q  1 

with  Q  —  |  q2  |  and  ^  = 

<?3 


fu  0  0 

0  ^22  0  | .  Note  that  here  the  constraint  (12.1 1) 

0  0  1/^33 


is  automatically  verified  since  k  —  1 .  We  have 


ai2<7i3 

023 


^12^23 

&13 


&13&23 

G\2 


and 


fn  =  on  ~  q\\  f  22  =  a22  -  qj;  ^33  =  033  ~  q\- 


In  this  particular  case  ik  —  1),  the  only  rotation  is  defined  by  Q  —  —1,  so  the  other 
solution  for  the  loadings  is  provided  by  —  Q. 

Example  12.2  Suppose  now  p  —  2  and  k  —  1,  then  d  <  0  and 

■E=(1  P\  =  (q\  +  fu  qiqi  A 
VP1/  V  qi^2  ql  +  f. 22/ 

We  have  infinitely  many  solutions:  for  any  G'(p<a'<l),a  solution  is  provided  by 

q\  =  o';  q2  =  p/a;  V^n  =  1  -  o'2;  ^22  =  1  -  (p/o')2. 

The  solution  in  Example  12.1  may  be  unique  (up  to  a  rotation),  but  it  is  not  proper 
in  the  sense  that  it  cannot  be  interpreted  statistically.  Exercise  12.5  gives  an  example 
where  the  specific  variance  \f/\\  is  negative. 


/?  \ 

L _ ! _ A  Even  in  the  case  of  a  unique  solution  (d  =  0),  the  solution  may  be 

inconsistent  with  statistical  interpretations. 


12.2  Estimation  of  the  Factor  Model 
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'  Summary 

^  The  factor  analysis  model  aims  to  describe  how  the  original  p  vari¬ 
ables  in  a  data  set  depend  on  a  small  number  of  latent  factors  k  <  p, 
i.e.  it  assumes  that  X  =  QF  +  U  +  p.  The  (/: -dimensional)  ran¬ 
dom  vector  F  contains  the  common  factors,  the  (p -dimensional) 
U  contains  the  specific  factors  and  Q(p  x  k)  contains  the  factor 
loadings. 

It  is  assumed  that  F  and  U  are  uncorrelated  and  have  zero  means, 
i.e.  F  ~  (0,Z),  U  ~  (0,  VP)  where  VP  is  diagonal  matrix  and 
Co v(F,  U)  =  0. 

This  leads  to  the  covariance  structure  E  =  QQT  +  VP. 

^  The  interpretation  of  the  factor  F  is  obtained  through  the  correla¬ 
tion  PXf  —  D~l^2Q. 

^  A  normalised  analysis  is  obtained  by  the  model  P  —  QQT  +  VP. 
The  interpretation  of  the  factors  is  given  directly  by  the  loadings 

Q  •  Pxf  —  Q- 

^  The  factor  analysis  model  is  scale  invariant.  The  loadings  are  not 
unique  (only  up  to  multiplication  by  an  orthogonal  matrix). 

^  Whether  a  model  has  an  unique  solution  or  not  is  determined  by 
the  degrees  of  freedom  d  —  1/2 (p  —  k )2  —  1/2 {p  +  k). 


12.2  Estimation  of  the  Factor  Model 


In  practice,  we  have  to  find  estimates  Q  of  the  loadings  Q  and  estimates  of  the 
specific  variances  ^  such  that  analogously  to  (12.7) 

<S  =  QQJ  + 


where  S  denotes  the  empirical  covariance  of  A.  Given  an  estimate  Q  of  Q,  it  is 
natural  to  set 


k 

h  =  sXjXj  -  ■ 

1=1 

We  have  that  h 2  —  Y^i=\  o)i  an  estimate  for  the  communality  h2 . 

In  the  ideal  case  d  —  0,  there  is  an  exact  solution.  However,  d  is  usually  greater 

A  A  A  A  j 

than  zero,  therefore  we  have  to  find  Q  and  such  that  S  is  approximated  by  QQ  1  + 

/V 

fh.  As  mentioned  above,  it  is  often  easier  to  compute  the  loadings  and  the  specific 
variances  of  the  standardised  model. 
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Define  y  =  ELXV  ^2,  the  standardisation  of  the  data  matrix  X,  where  V  — 
diag(^XiZi,  •  •  •  ?  sXpXp)  and  the  centering  matrix  XL  —  X  —  n~l  lnlj  (recall  from 

Chap.  2  that  S  =  jrX  XLX).  The  estimated  factor  loading  matrix  Qy  and  the 

/V 

estimated  specific  variance  of  y  are 

Qy  =  V~l/2QX  and  i>Y  = 


For  the  correlation  matrix  1Z  of  X,  we  have  that 


n=  QyQJ  +  Vy. 


The  interpretations  of  the  factors  are  formulated  from  the  analysis  of  the  loadings 

Qy. 

Example  12.3  Let  us  calculate  the  matrices  just  defined  for  the  car  data  given  in 
Table  22.7.  This  data  set  consists  of  the  averaged  marks  (from  1  =low  to  6  =high) 
for  24  car  types.  Considering  the  three  variables  price,  security  and  easy  handling, 
we  get  the  following  correlation  matrix: 


/  1  0.975  0.613\ 

U  =  0.975  1  0.620  . 

\0.613  0.620  1  / 


We  will  first  look  for  one  factor,  i.e.  k  —  1 .  Note  that  (#  number  of  parameters  of  £ 
unconstrained  -  #  parameters  of  £  constrained)  is  equal  to  \(p  —  k)2  —  j(p  +  k)  — 
^(3  —  l)2  —  | (3  +  1)  =  0.  This  implies  that  there  is  an  exact  solution!  The  equation 


/  1  rXlx2rXiXi\  {q\  +  tu  qih  \ 

I  rxxx2  1  rXlx3  I  —  TZ  —  I  q\<l2  qiQz  I 

\rxix3rx2x3  1  /  \  <7i<?3  <?2<?3  ^3  +  ^33/ 


yields  the  communalities  hj  =  qf,  where 


a2  _  rxxx2rxxx3 

H\  ~  > 

rX2X3 


r X  1  *2^2*3 

rXix2 


and 


rxix3rx2x3 


rxlx2 


aO  yvO 

Combining  this  with  the  specific  variances  x/rn  =  1  —  q{  ,  ^22  —  1  —  q3  an<^ 
V^33  —  1  —  q3 ,  we  obtain  the  following  solution 


qi  =  0.982  q2  =  0.993 

xjr  11  =  0.035  xjr 22  —  0.014 


q3  =  0.624 

V^33  —  0.610. 
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Since  the  first  two  communalities  (A?  =  qf)  are  close  to  one,  we  can  conclude  that 
the  first  two  variables,  namely  price  and  security,  are  explained  by  the  single  factor 
quite  well.  This  factor  can  be  interpreted  as  a  “price+security”  factor. 


The  Maximum  Likelihood  Method 

Recall  from  Chap.  6  the  log-likelihood  function  i  for  a  data  matrix  X  of  observations 
of  X  ~  Np(/jL ,  E): 


1  n 

l{X\ix,  S)  =  “log  |  27 rS  |  --  -/x)S_l(x,  -/i)T 

/  =  1 

=  “log  I  27rS  I  -|tr(s_1>5)  -  |(x-/A)S_1(x-/A)T 


This  can  be  rewritten  as 


n 


£(X;  ji,  S)  =  —  {log  |  2ttS  |  +  tr(S_1,S)} 


Replacing  /x  by  /i  =  x  and  substituting  E  =  QQT  +  ^  this  becomes 


£(X;  A,  Q,  [log{|  2tt(QQt  +  ^)  |}  +  tr{(QQT  +  ^)~lS}] . 

(12.13) 

Even  in  the  case  of  a  single  factor  ( k  —  1),  these  equations  are  rather  complicated 
and  iterative  numerical  algorithms  have  to  be  used  [for  more  details  see  Mardia, 
Kent  &  Bibby,  1979,  p.  263].  A  practical  computation  scheme  is  also  given  in 
Supplement  9A  of  Johnson  and  Wichern  (1998). 


Likelihood  Ratio  Test  for  the  Number  of  Common  Factors 


Using  the  methodology  of  Chap.  7,  it  is  easy  to  test  the  adequacy  of  the  factor 
analysis  model  by  comparing  the  likelihood  under  the  null  (factor  analysis)  and 
alternative  (no  constraints  on  covariance  matrix)  hypotheses. 

/V  /V 

Assuming  that  Q  and  tb  are  the  maximum  likelihood  estimates  corresponding 
to  (12.13),  we  obtain  the  following  LR  test  statistic: 


2  log  I 


(  maximised  likelihood  under  Hq 


maximised  likelihood 


=  n  log 


/V  /V  —r—  /V 

QSt  +  'I' 

\S\ 


(12.14) 


which  asymptotically  has  the  x\,  2 

2 \\P~k)  ~P~kf 


distribution. 
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The  x2  approximation  can  be  improved  if  we  replace  nby  n  —  1  —  (2p-\-4k+5)/6 
in  (12.14)  (Bartlett,  1954).  Using  Bartlett’s  correction,  we  reject  the  factor  analysis 
model  at  the  a  level  if 


{n  —  1  -  (2 p  +  \k  +  5)/ 6}  log 


(yv  /v  _  /v 

|QQT  + 1* 
\S\ 


^  ^1— a\{{p— k)2—p—k}/2’  (12.15) 


and  if  the  number  of  observations  n  is  large  and  the  number  of  common  factors  k  is 
such  that  the  /2  statistic  has  a  positive  number  of  degrees  of  freedom. 


The  Method  of  Principal  Factors 


The  method  of  principal  factors  concentrates  on  the  decomposition  of  the  correla¬ 
tion  matrix  7 Z  or  the  covariance  matrix  S.  For  simplicity,  only  the  method  for  the 
correlation  matrix  7 Z  will  be  discussed.  As  pointed  out  in  Chap.  11,  the  spectral 
decompositions  of  1Z  and  S  yield  different  results  and  therefore,  the  method  of 
principal  factors  may  result  in  different  estimators.  The  method  can  be  motivated  as 
follows:  Suppose  we  know  the  exact  tp,  then  the  constraint  (12.12)  implies  that  the 
columns  of  Q  are  orthogonal  since  V  —  T  and  it  implies  that  they  are  eigenvectors 
of  QQt  —  1Z  —  tp.  Furthermore,  assume  that  the  first  k  eigenvalues  are  positive.  In 
this  case  we  could  calculate  Q  by  means  of  a  spectral  decomposition  of  QQT  and 
k  would  be  the  number  of  factors. 

/V  _ 

The  principal  factors  algorithm  is  based  on  good  preliminary  estimators  h2  of  the 
communalities  h2 ,  for  j  —  1, . . . ,  p.  There  are  two  traditional  proposals: 


/a 

h2,  defined  as  the  square  of  the  multiple  correlation  coefficient  of  Xj  with  (X/), 

for  /  j ,  i.e.  p2(V,  W with  V  =  Xj ,  W  —  {Xf)i^j  and  where  /§  is  the  least 
squares  regression  parameter  of  a  regression  of  V  on  IF. 

P  r\ 

h  :  —  max  rx;xA,  where  1Z  —  ( rv  x *)  is  the  correlation  matrix  of  X. 


Given  i fa  —  1  —  h~j  we  can  construct  the  reduced  correlation  matrix ,  1Z  —  4k  The 
Spectral  Decomposition  Theorem  says  that 


p 

u-w  = 

t=\ 

with  eigenvalues  X\  >  •  •  •  >  Xp.  Assume  that  the  first  k  eigenvalues  X\, . . . ,  X^  are 
positive  and  large  compared  to  the  others.  Then  we  can  set 


qi  —  sfXi  yt  ,  l  —  1 , . . . ,  k 
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or 


Q  =  TX  A 


1/2 

1 


with 


H  =  (yi, . . . ,  Yk)  and  Ai  =  diag(Ai, . . .  ,A*). 
In  the  next  step  set 


k 

h  =  x  ~Yhk]i  >  j  = 

l=\ 


Note  that  the  procedure  can  be  iterated:  from  xj/jj  we  can  compute  a  new  reduced 
correlation  matrix  1Z  —  ^  following  the  same  procedure.  The  iteration  usually  stops 

✓V 

when  the  i fa  have  converged  to  a  stable  value. 

Example  12.4  Consider  once  again  the  car  data  given  in  Table  22.7.  From  Exer¬ 
cise  11.4  we  know  that  the  first  PC  is  mainly  influenced  by  X^-Xq.  Moreover,  we 
know  that  most  of  the  variance  is  already  captured  by  the  first  PC.  Thus  we  can 
conclude  that  the  data  are  mainly  determined  by  one  factor  ( k  —  1). 

/V  /V 

The  eigenvalues  of  1Z  —  'T  for  T*  =  (max  | rxix]  I)  are 

(4.628, 1.340, 1.201, 1.045, 1.007,0.993,0.980,  -4.028)T  . 


It  would  suffice  to  choose  only  one  factor.  Nevertheless,  we  have  computed  two 
factors.  The  result  (the  factor  loadings  for  two  factors)  is  shown  in  Fig.  12.1. 
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Fig.  12.1  Loadings  of  the  evaluated  car  qualities,  factor  analysis  with  k  =  2  Q  MVAf  actcarm 
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We  can  clearly  see  a  cluster  of  points  to  the  right,  which  contain  the  factor 
loadings  for  the  variables  X2-X7.  This  shows,  as  did  the  PCA,  that  these  variables 
are  highly  dependent  and  are  thus  more  or  less  equivalent.  The  factor  loadings  for 
X\  (economy)  and  Xg  (easy  handling)  are  separate,  but  note  the  different  scales  on 
the  horizontal  and  vertical  axes !  Although  there  are  two  or  three  sets  of  variables 
in  the  plot,  the  variance  is  already  explained  by  the  first  factor,  the  “price+security” 
factor. 


The  Principal  Component  Method 

The  principal  factor  method  (PFM)  involves  finding  an  approximation  of 'T,  the 
matrix  of  specific  variances,  and  then  correcting  7 Z,  the  correlation  matrix  of  X, 

■ — '  /V 

by  'T.  The  principal  component  method  (PCM)  starts  with  an  approximation  Q 
of  Q,  the  factor  loadings  matrix.  The  sample  covariance  matrix  is  diagonalised, 
S  —  TATt.  Then  the  first  k  eigenvectors  are  retained  to  build 

Q  =  (VTiyu...,VhYk)-  (12.16) 

The  estimated  specific  variances  are  provided  by  the  diagonal  elements  of  the 
matrix  S  —  QQT , 

fn  0 

y\ 

fll 

A 

b  f pp 

By  definition,  the  diagonal  elements  of  S  are  equal  to  the  diagonal  elements  of 
QQ  1  +  TF  The  off-diagonal  elements  are  not  necessarily  estimated.  How  good  then 
is  this  approximation?  Consider  the  residual  matrix 

5-(QQt  +  ^) 

resulting  from  the  principal  component  solution.  Analytically  we  have  that 

Txs  -  -  ^)2ij  ~  ^1+ 1  ■+ — I-  ^2P- 

Cj 

This  implies  that  a  small  value  of  the  neglected  eigenvalues  can  result  in  a  small 
approximation  error.  A  heuristic  device  for  selecting  the  number  of  factors  is  to 
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consider  the  proportion  of  the  total  sample  variance  due  to  the  y  th  factor.  This 
quantity  is  in  general  equal  to 

(a)  A  j  /  YHj= 1  sjj  f°r  a  factor  analysis  of  <S, 

(b)  Xj  /  p  for  a  factor  analysis  of  1Z. 

Example  12.5  This  example  uses  a  consumer-preference  study  from  Johnson 
and  Wichern  (1998).  Customers  were  asked  to  rate  several  attributes  of  a  new 
product.  The  responses  were  tabulated  and  the  following  correlation  matrix  1Z  was 
constructed: 


Attribute  ( Variable ) 

Taste  1 

Good  buy  for  money  2 

Flavor  3 

Suitable  for  snack  4 


Provides  lots  of  energy  5 


/  1.00  0.02  0.96  0.42  0.01  \ 
0.02  1.00  0.13  0.71  0.85 
0.96  0.13  1.00  0.50  0.11 
0.42  0.71  0.50  1.00  0.79 
\0.01  0.85  0.11  0.79  1.00 ) 


The  bold  entries  of  1Z  show  that  variables  1  and  3  and  variables  2  and  5  are  highly 
correlated.  Variable  4  is  more  correlated  with  variables  2  and  5  than  with  variables 
1  and  3.  Hence,  a  model  with  2  (or  3)  factors  seems  to  be  reasonable. 

The  first  two  eigenvalues  X\  =  2.85  and  A2  =  1 .81  of  1Z  are  the  only  eigenvalues 
greater  than  one.  Moreover,  k  —  2  common  factors  account  for  a  cumulative 
proportion 


Ai  +  A2  _  2.85  +  1.81  _  q 
P  5 

of  the  total  (standardised)  sample  variance.  Using  the  PCM,  the  estimated 
factor  loadings,  communalities,  and  specific  variances  are  calculated  from 
formulas  (12.16)  and  (12.17),  and  the  results  are  given  in  Table  12.1. 


Table  12.1  Estimated  factor  loadings,  communalities,  and  specific  variances 


Estimated  factor 

Specific 

loadings 

Communalities 

variances 

Variable 

A 

q\ 

yv 

<72 

h) 

fjj=  1  -  hj 

1. 

Taste 

0.56 

0.82 

0.98 

0.02 

2. 

Good  buy  for  money 

0.78 

-0.53 

0.88 

0.12 

3. 

Flavor 

0.65 

0.75 

0.98 

0.02 

4. 

Suitable  for  snack 

0.94 

-0.11 

0.89 

0.11 

5. 

Provides  lots  of  energy 

0.80 

-0.54 

0.93 

0.07 

Eigenvalues 

2.85 

1.81 

Cumulative  proportion  of  total 
(standardised)  sample  variance 

0.571 

0.932 
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Take  a  look  at: 


QQ'  +* 


/  0.56  0.82  \ 
0.78  -0.53 
0.65  0.75 

0.94  -0.11 
y  0.80  -0.54  j 


f  0.56  0.78  0.65  0.94  0.80  \ 
y  0.82  -0.53  0.75  -0.11  -0.54  ) 


0.02  0 

0 

0 

0 

\ 

(  1.00  0.01  0.97  0.44  0.00 \ 

0 

0.12 

0 

0 

0 

0.01  1.00  0.11  0.79  0.91 

0 

0 

0.02  0 

0 

— 

0.97  0.11  1.00  0.53  0.11 

0 

0 

0 

0.11  0 

0.44  0.79  0.53  1.00  0.81 

\° 

0 

0 

0 

0.07 

J 

^0.00  0.91  0.11  0.81  1.00  j 

This  nearly  reproduces  the  correlation  matrix  7 Z.  We  conclude  that  the  two- 
factor  model  provides  a  good  fit  of  the  data.  The  communalities  (0.98, 0.88, 0.98, 
0.89,  0.93)  indicate  that  the  two  factors  account  for  a  large  percentage  of  the 
sample  variance  of  each  variable.  Due  to  the  nonuniqueness  of  factor  loadings,  the 
interpretation  might  be  enhanced  by  rotation.  This  is  the  topic  of  the  next  subsection. 


Rotation 

The  constraints  (12.11)  and  (12.12)  are  given  as  a  matter  of  mathematical  con¬ 
venience  (to  create  unique  solutions)  and  can  therefore  complicate  the  problem 
of  interpretation.  The  interpretation  of  the  loadings  would  be  very  simple  if  the 
variables  could  be  split  into  disjoint  sets,  each  being  associated  with  one  factor. 
A  well-known  analytical  algorithm  to  rotate  the  loadings  is  given  by  the  varimax 
rotation  method  proposed  by  Kaiser  (1985).  In  the  simplest  case  of  k  —  2  factors,  a 
rotation  matrix  Q  is  given  by 


(  cos  0  sin  0 
y  —  sin  0  cos  0 


9 


representing  a  clockwise  rotation  of  the  coordinate  axes  by  the  angle  0.  The 

/V  /V 

corresponding  rotation  of  loadings  is  calculated  via  Q*  =  QQ(0).  The  idea  of 
the  varimax  method  is  to  find  the  angle  0  that  maximises  the  sum  of  the  variances 
of  the  squared  loadings  q*  within  each  column  of  <2*.  More  precisely,  defining 

✓v 

q*t  =  q*i/h* ,  the  varimax  criterion  chooses  0  so  that 
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1  k 
V  =  -T, 

ph 


£<«>4-H£<0> 

7=i  r  i= i 


is  maximised. 

Example  12.6  Let  us  return  to  the  marketing  example  of  Johnson  and  Wichern 
(1998)  (Example  12.5).  The  basic  factor  loadings  given  in  Table  12.1  of  the 
first  factor  and  a  second  factor  are  almost  identical  making  it  difficult  to  inter¬ 
pret  the  factors.  Applying  the  varimax  rotation  we  obtain  the  loadings  q\  — 
(0.02, 0.94, 0.13, 0.84,  0.97)t  and  q2  =  (0.99,  -0.01, 0.98, 0.43,  -0.02)T.  The 
high  loadings,  indicated  as  bold  entries,  show  that  variables  2,  4,  5  define  factor  1, 
a  nutritional  factor.  Variables  1  and  3  define  factor  2  which  might  be  referred  to  as  a 
taste  factor. 


Summary 


In  practice,  Q  and  4/  have  to  be  estimated  from  S  =  QQ  +  Tk 
The  number  of  parameters  is  d  —  —  k)2  —  +  k). 

If  d  —  0,  then  there  exists  an  exact  solution.  In  practice,  d  is 
usually  greater  than  0,  thus  approximations  must  be  considered. 


The  maximum-likelihood  method  assumes  a  normal  distribution 
for  the  data.  A  solution  can  be  found  using  numerical  algorithms. 


The  method  of  principal  factors  is  a  two- stage  method  which 

/v  _  /V 

calculates  Q  from  the  reduced  correlation  matrix  1Z  —  Tk  where 


^  is  a  pre-estimate  of  Tk  The  final  estimate  of  4/  is  found  by 


tu  =  1  -  E/  =  l  <?,7 


J 


IJ 


/v 

The  PCM  is  based  on  an  approximation,  Q,  of  Q. 


Often  a  more  informative  interpretation  of  the  factors  can  be  found 
by  rotating  the  factors. 


The  varimax  rotation  chooses  a 


v  =  j  el 


rotation  6  that 

2' 


maximises 
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12.3  Factor  Scores  and  Strategies 


Up  to  now  strategies  have  been  presented  for  factor  analysis  that  have  concentrated 
on  the  estimation  of  loadings  and  communalities  and  on  their  interpretations.  This 
was  a  logical  step  since  the  factors  F  were  considered  to  be  normalised  random 
sources  of  information  and  were  explicitly  addressed  as  nonspecific  (common 
factors).  The  estimated  values  of  the  factors,  called  th e  factor  scores ,  may  also  be 
useful  in  the  interpretation  as  well  as  in  the  diagnostic  analysis.  To  be  more  precise, 
the  factor  scores  are  estimates  of  the  unobserved  random  vectors  Fi,  l  — 
for  each  individual  V/,  i  =  1 , ,n.  Johnson  and  Wichern  (1998)  describe  three 
methods  which  in  practice  yield  very  similar  results.  Here,  we  present  the  regression 
method  which  has  the  advantage  of  being  the  simplest  technique  and  is  easy  to 
implement. 

The  idea  is  to  consider  the  joint  distribution  of  (X—fi)  and  F ,  and  then  to  proceed 
with  the  regression  analysis  presented  in  Chap.  5.  Under  the  factor  model  (12.4),  the 
joint  covariance  matrix  of  ( X  —  /x)  and  F  is: 


Var 


X  —  p 
F 


(QQT  +  V  Q\ 

V  2T  Xt) ' 


(12.18) 


Note  that  the  upper  left  entry  of  this  matrix  equals  E  and  that  the  matrix  has  size 
(p  +  k)  x  (p  +  k). 

Assuming  joint  normality,  the  conditional  distribution  of  F\X  is  multinormal, 
see  Theorem  5.1,  with 


E(F\X  =  x)  =  (12.19) 

and  using  (5.7)  the  covariance  matrix  can  be  calculated: 

\lar(F\X  =  x)  =  lk  -  QTS_1Q.  (12.20) 

In  practice,  we  replace  the  unknown  Q,  E  and  {i  by  corresponding  estimators, 
leading  to  the  estimated  individual  factor  scores: 


fi  -  QfS-\xi  -X) 


(12.21) 


We  prefer  to  use  the  original  sample  covariance  matrix  S  as  an  estimator  of  E, 

yv  /\  — j—  y\ 

instead  of  the  factor  analysis  approximation  QQ  +  4/,  in  order  to  be  more  robust 
against  incorrect  determination  of  the  number  of  factors. 

The  same  rule  can  be  followed  when  using  7Z  instead  of  S.  Then  (12. 18)  remains 
valid  when  standardised  variables,  i.e.  Z  =  T*^1  (X  —  /x),  are  considered  if  — 
diag(<7n , . . . ,  GpP).  In  this  case  the  factors  are  given  by 
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_ 1 1 2  _  ^ 

where  zi  —  Vs  (xz  —  x),  Q  is  the  loading  obtained  with  the  matrix  7Z,  and  Vs  — 
diag(sn, . . . ,  spp). 

If  the  factors  are  rotated  by  the  orthogonal  matrix  Q,  the  factor  scores  have  to  be 
rotated  accordingly,  that  is 


f*  =  Gji-  (12.23) 


A  practical  example  is  presented  in  Sect.  12.4  using  the  Boston  Housing  data. 


Practical  Suggestions 

No  one  method  outperforms  another  in  the  practical  implementation  of  factor 
analysis.  However,  by  applying  a  tatonnement  process,  the  factor  analysis  view  of 
the  data  can  be  stabilised.  This  motivates  the  following  procedure. 

1.  Fix  a  reasonable  number  of  factors,  say  k  =  2  or  3,  based  on  the  correlation 
structure  of  the  data  and/or  screeplot  of  eigenvalues. 

2.  Perform  several  of  the  presented  methods,  including  rotation.  Compare  the 
loadings,  communalities,  and  factor  scores  from  the  respective  results. 

3.  If  the  results  show  significant  deviations,  check  for  outliers  (based  on  factor 
scores),  and  consider  changing  the  number  of  factors  k. 

For  larger  data  sets,  cross-validation  methods  are  recommended.  Such  methods 
involve  splitting  the  sample  into  a  training  set  and  a  validation  data  set.  On  the 
training  sample  one  estimates  the  factor  model  with  the  desired  methodology  and 
uses  the  obtained  parameters  to  predict  the  factor  scores  for  the  validation  data  set. 
The  predicted  factor  scores  should  be  comparable  to  the  factor  scores  obtained  using 
only  the  validation  data  set.  This  stability  criterion  may  also  involve  the  loadings  and 
communalities. 


Factor  Analysis  Versus  PCA 

Factor  analysis  and  principal  component  analysis  use  the  same  set  of  mathematical 
tools  (spectral  decomposition,  projections,  . . . ).  One  could  conclude,  on  first  sight, 
that  they  share  the  same  view  and  strategy  and  therefore  yield  very  similar  results. 
This  is  not  true.  There  are  substantial  differences  between  these  two  data  analysis 
techniques  that  we  would  like  to  describe  here. 

The  biggest  difference  between  PCA  and  factor  analysis  comes  from  the  model 
philosophy.  Factor  analysis  imposes  a  strict  structure  of  a  fixed  number  of  common 
(latent)  factors  whereas  the  PCA  determines  p  factors  in  decreasing  order  of 
importance.  The  most  important  factor  in  PCA  is  the  one  that  maximises  the 
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projected  variance.  The  most  important  factor  in  factor  analysis  is  the  one  that  (after 
rotation)  gives  the  maximal  interpretation.  Often  this  is  different  from  the  direction 
of  the  first  principal  component. 

From  an  implementation  point  of  view,  the  PCA  is  based  on  a  well-defined, 
unique  algorithm  (spectral  decomposition),  whereas  fitting  a  factor  analysis  model 
involves  a  variety  of  numerical  procedures.  The  non-uniqueness  of  the  factor 
analysis  procedure  opens  the  door  for  subjective  interpretation  and  yields  therefore 
a  spectrum  of  results.  This  data  analysis  philosophy  makes  factor  analysis  difficult 
especially  if  the  model  specification  involves  cross-validation  and  a  data-driven 
selection  of  the  number  of  factors. 


12.4  Boston  Housing 

To  illustrate  how  to  implement  factor  analysis  we  will  use  the  Boston  Housing  data 
set  and  the  by  now  well-known  set  of  transformations.  Once  again,  the  variable  X4 
(Charles  River  indicator)  will  be  excluded.  As  before,  standardised  variables  are 
used  and  the  analysis  is  based  on  the  correlation  matrix. 

In  Sect.  12.3,  we  described  a  practical  implementation  of  factor  analysis.  Based 
on  principal  components,  three  factors  were  chosen  and  factor  analysis  was  applied 
using  the  maximum  likelihood  method  (MLM),  the  PFM,  and  the  PCM.  For 
illustration,  the  MLM  will  be  presented  with  and  without  varimax  rotation. 

Table  12.2  gives  the  MLM  factor  loadings  without  rotation  and  Table  12.3  gives 
the  varimax  version  of  this  analysis.  The  corresponding  graphical  representations 
of  the  loadings  are  displayed  in  Figs.  12.2  and  12.3.  We  can  see  that  the  varimax 


Table  12.2  Estimated  factor  loadings,  communalities,  and  specific  variances,  MLM  Q 
MVAf acthous 


Estimated  factor 

loadings 

Communalities 

h) 

Specific 

variances 

fa  =  1  -  hj 

A. 

q\ 

■A. 

<72 

<73 

1 

Crime 

0.9295 

0.1653 

0.1107 

0.9036 

0.0964 

2 

Large  lots 

-0.5823 

0.0379 

0.2902 

0.4248 

0.5752 

3 

Nonretail  acres 

0.8192 

-0.0296 

-0.1378 

0.6909 

0.3091 

5 

Nitric  oxides 

0.8789 

0.0987 

-0.2719 

0.8561 

0.1439 

6 

Rooms 

-0.4447 

0.5311 

-0.0380 

0.4812 

0.5188 

7 

Prior  1940 

0.7837 

-0.0149 

-0.3554 

0.7406 

0.2594 

8 

Empl.  centers 

-0.8294 

-0.1570 

0.4110 

0.8816 

0.1184 

9 

Accessibility 

0.7955 

0.3062 

0.4053 

0.8908 

0.1092 

10 

Tax-rate 

0.8262 

0.1401 

0.2906 

0.7867 

0.2133 

11 

Pupil/teacher 

0.5051 

-0.1850 

0.1553 

0.3135 

0.6865 

12 

African  American 

0.4701 

-0.0227 

-0.1627 

0.2480 

0.7520 

13 

Lower  status 

0.7601 

-0.5059 

-0.0070 

0.8337 

0.1663 

14 

Value 

-0.6942 

0.5904 

-0.1798 

0.8628 

0.1371 
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Table  12.3  Estimated  factor  loadings,  communalities,  and  specific  variances,  MLM,  varimax 
rotation  Q  MVAfacthous 


Estimated  factor 

loadings 

Communalities 

h) 

Specific 

variances 

fa  =  1  -  hj 

A 

<7i 

A 

<72 

/V 

<73 

1 

Crime 

0.7247 

-0.2705 

-0.5525 

0.9036 

0.0964 

2 

Large  lots 

-0.1570 

0.2377 

0.5858 

0.4248 

0.5752 

3 

Nonretail  acres 

0.4195 

-0.3566 

-0.6287 

0.6909 

0.3091 

5 

Nitric  oxides 

0.4141 

-0.2468 

-0.7896 

0.8561 

0.1439 

6 

Rooms 

-0.0799 

0.6691 

0.1644 

0.4812 

0.5188 

7 

Prior  1940 

0.2518 

-0.2934 

-0.7688 

0.7406 

0.2594 

8 

Empl.  centers 

-0.3164 

0.1515 

0.8709 

0.8816 

0.1184 

9 

Accessibility 

0.8932 

-0.1347 

-0.2736 

0.8908 

0.1092 

10 

Tax-rate 

0.7673 

-0.2772 

-0.3480 

0.7867 

0.2133 

11 

Pupil/teacher 

0.3405 

-0.4065 

-0.1800 

0.3135 

0.6865 

12 

African  American 

-0.3917 

0.2483 

0.1813 

0.2480 

0.7520 

13 

Lower  status 

0.2586 

-0.7752 

-0.4072 

0.8337 

0.1663 

14 

Value 

-0.3043 

0.8520 

0.2111 

0.8630 

0.1370 
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Fig.  12.2  Factor  analysis  for  Boston  housing  data,  MLM  Q  MVAfacthous 
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Fig.  12.3  Factor  analysis  for  Boston  housing  data,  MLM  after  varimax  rotation  Q 
MVAf acthous 


does  not  significantly  change  the  interpretation  of  the  factors  obtained  by  the  MLM. 
Factor  1  can  be  roughly  interpreted  as  a  “quality  of  life  factor”  because  it  is 
positively  correlated  with  variables  like  X\\  and  negatively  correlated  with  X8,  both 
having  low  specific  variances.  The  second  factor  may  be  interpreted  as  a  “residential 
factor”,  since  it  is  highly  correlated  with  variables  X6,  and  X\2.  The  most  striking 
difference  between  the  results  with  and  without  varimax  rotation  can  be  seen  by 
comparing  the  lower  left  corners  of  Figs.  12.2  and  12.3.  There  is  a  clear  separation 
of  the  variables  in  the  varimax  version  of  the  MLM.  Given  this  arrangement  of  the 
variables  in  Fig.  12.3,  we  can  interpret  factor  3  as  an  employment  factor,  since  we 
observe  high  correlations  with  X8  and  X5. 

We  now  turn  to  the  PCM  and  PFM  analyses.  The  results  are  presented  in 
Tables  12.4  and  12.5  and  in  Figs.  12.4  and  12.5.  We  would  like  to  focus  on  the 
PCM,  because  this  three-factor  model  yields  only  one  specific  variance  (unexplained 
variation)  above  0.5.  Looking  at  Fig.  12.4,  it  turns  out  that  factor  1  remains  a  “quality 
of  life  factor”  which  is  clearly  visible  from  the  clustering  of  X5,  X$,  X\q  and  X\ 
on  the  right-hand  side  of  the  graph,  while  the  variables  X%,  X2,  Xi4,  X\2  and  X6 
are  on  the  left-hand  side.  Again,  the  second  factor  is  a  “residential  factor”,  clearly 


12.4  Boston  Housing 


381 


Table  12.4  Estimated  factor  loadings,  communalities,  and  specific  variances,  PCM,  varimax 
rotation  Q  MVAfacthous 


Estimated  factor 

loadings 

Communalities 

h) 

Specific 

variances 

fa  =  1  -  hj 

A 

<7i 

A 

<72 

A 

<73 

1 

Crime 

0.6034 

-0.2456 

0.6864 

0.8955 

0.1045 

2 

Large  lots 

-0.7722 

0.2631 

0.0270 

0.6661 

0.3339 

3 

Nonretail  acres 

0.7183 

-0.3701 

0.3449 

0.7719 

0.2281 

5 

Nitric  oxides 

0.7936 

-0.2043 

0.4250 

0.8521 

0.1479 

6 

Rooms 

-0.1601 

0.8585 

0.0218 

0.7632 

0.2368 

7 

Prior  1940 

0.7895 

-0.2375 

0.2670 

0.7510 

0.2490 

8 

Empl.  centers 

-0.8562 

0.1318 

-0.3240 

0.8554 

0.1446 

9 

Accessibility 

0.3681 

-0.1268 

0.8012 

0.7935 

0.2065 

10 

Tax-rate 

0.3744 

-0.2604 

0.7825 

0.8203 

0.1797 

11 

Pupil/teacher 

0.1982 

-0.5124 

0.3372 

0.4155 

0.5845 

12 

African  American 

0.1647 

0.0368 

-0.7002 

0.5188 

0.4812 

13 

Lower  status 

0.4141 

-0.7564 

0.2781 

0.8209 

0.1791 

14 

Value 

-0.2111 

0.8131 

-0.3671 

0.8394 

0.1606 

Table  12.5  Estimated  factor  loadings,  communalities,  and  specific  variances,  PFM,  varimax 
rotation  Q  MVAfacthous 


Estimated  factor 

loadings 

Communalities 

Specific 

variances 
tjj  =  1  -  h) 

A. 

<7i 

A 

<72 

A 

<73 

1 

Crime 

0.5477 

-0.2558 

-0.7387 

0.9111 

0.0889 

2 

Large  lots 

-0.6148 

0.2668 

0.1281 

0.4655 

0.5345 

3 

Nonretail  acres 

0.6523 

-0.3761 

-0.3996 

0.7266 

0.2734 

5 

Nitric  oxides 

0.7723 

-0.2291 

-0.4412 

0.8439 

0.1561 

6 

Rooms 

-0.1732 

0.6783 

0.1296 

0.0699 

0.5046 

7 

Prior  1940 

0.7390 

-0.2723 

-0.2909 

0.7049 

0.2951 

8 

Empl.  centers 

-0.8565 

0.1485 

0.3395 

0.8708 

0.1292 

9 

Accessibility 

0.2855 

-0.1359 

-0.8460 

0.8156 

0.1844 

10 

Tax-rate 

0.3062 

-0.2656 

-0.8174 

0.8325 

0.1675 

11 

Pupil/teacher 

0.2116 

-0.3943 

-0.3297 

0.3090 

0.6910 

12 

African  American 

0.1994 

0.0666 

0.4217 

0.2433 

0.7567 

13 

Lower  status 

0.4005 

-0.7743 

-0.2706 

0.8333 

0.1667 

14 

Value 

-0.1885 

0.8400 

0.3473 

0.8611 

0.1389 
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Fig.  12.4  Factor  analysis  for  Boston  housing  data,  PCM  after  varimax  rotation  Q  MVAf  acthous 


demonstrated  by  the  location  of  variables  X$,  Xu ,  X\\,  and  X\$.  The  interpretation 
of  the  third  factor  is  more  difficult  because  all  of  the  loadings  (except  for  Xu)  are 
very  small. 


12.5  Exercises 


Exercise  12.1  In  Example  12.4  we  have  computed  Q  and  'I'  using  the  method  of 

A 

principal  factors.  We  used  a  two-step  iteration  for  Th  Perform  the  third  iteration  step 

/V 

and  compare  the  results  (i.e.  use  the  given  Q  as  a  pre-estimate  to  find  the  final  ) . 

Exercise  12.2  Using  the  bank  data  set,  how  many  factors  can  you  find  with  the 
Method  of  Principal  Factors? 

Exercise  12.3  Repeat  Exercise  12.2  with  the  US  company  data  set! 

Exercise  12.4  Generalise  the  two-dimensional  rotation  matrix  in  Sect.  12.2  to 
n- dimensional  space. 
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Fig.  12.5  Factor  analysis  for  Boston  housing  data,  PFM  after  varimax  rotation  Q  MVAf  acthous 


Exercise  12.5  Compute  the  orthogonal  factor  model  for 

/  1  0.9  0.7  \ 

S  =  0.9  1  0.4  . 

V0.7  0.4  1  / 


[ Solution :  —  — 0.515, qn  =  1.255] 

Exercise  12.6  Perform  a  factor  analysis  on  the  type  of  families  in  the  French  food 
data  set.  Rotate  the  resulting  factors  in  a  way  which  provides  the  most  reasonable 
interpretation.  Compare  your  result  with  the  varimax  method. 

Exercise  12.7  Perform  a  factor  analysis  on  the  variables  X 3  to  Xg  in  the  US  crime 
data  set  ( Table  22.10).  Would  it  make  sense  to  use  all  of  the  variables  for  the 
analysis? 

Exercise  12.8  Analyse  the  athletic  records  data  set  ( Table  22.18).  Can  you  recog¬ 
nise  any  patterns  if  you  sort  the  countries  according  to  the  estimates  of  the  factor 
scores? 
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Exercise  12.9  Perform  a  factor  analysis  on  the  US  health  data  set  ( Table  22.16) 
and  estimate  the  factor  scores. 

Exercise  12.10  Redo  Exercise  12.9  using  the  US  crime  data  in  Table  22.10. 
Compare  the  estimated  factor  scores  of  the  two  data  sets. 

Exercise  12.11  Analyse  the  vocabulary  data  given  in  Table  22.17. 


Chapter  13 

Cluster  Analysis 


The  next  two  chapters  address  classification  issues  from  two  varying  perspectives. 
When  considering  groups  of  objects  in  a  multivariate  data  set,  two  situations  can 
arise.  Given  a  data  set  containing  measurements  on  individuals,  in  some  cases  we 
want  to  see  if  some  natural  groups  or  classes  of  individuals  exist,  and  in  other 
cases,  we  want  to  classify  the  individuals  according  to  a  set  of  existing  groups. 
Cluster  analysis  develops  tools  and  methods  concerning  the  former  case,  that  is, 
given  a  data  matrix  containing  multivariate  measurements  on  a  large  number  of 
individuals  (or  objects),  the  objective  is  to  build  some  natural  sub-groups  or  clusters 
of  individuals.  This  is  done  by  grouping  individuals  that  are  “similar”  according  to 
some  appropriate  criterion.  Once  the  clusters  are  obtained,  it  is  generally  useful  to 
describe  each  group  using  some  descriptive  tool  from  Chaps.  1,  10  or  1 1  to  create  a 
better  understanding  of  the  differences  that  exist  among  the  formulated  groups. 

Cluster  analysis  is  applied  in  many  fields  such  as  the  natural  sciences,  the  medical 
sciences,  economics,  marketing,  etc.  In  marketing,  for  instance,  it  is  useful  to 
build  and  describe  the  different  segments  of  a  market  from  a  survey  on  potential 
consumers.  An  insurance  company,  on  the  other  hand,  might  be  interested  in  the 
distinction  among  classes  of  potential  customers  so  that  it  can  derive  optimal  prices 
for  its  services.  Other  examples  are  provided  below. 

Discriminant  analysis  presented  in  Chap.  14  addresses  the  other  issue  of  clas¬ 
sification.  It  focuses  on  situations  where  the  different  groups  are  known  a  priori. 
Decision  rules  are  provided  in  classifying  a  multivariate  observation  into  one  of  the 
known  groups. 

Section  13.1  states  the  problem  of  cluster  analysis  where  the  criterion  chosen  to 
measure  the  similarity  among  objects  clearly  plays  an  important  role.  Section  13.2 


©  Springer- Verlag  Berlin  Heidelberg  2015 

W.K.  Hardle,  L.  Simar,  Applied  Multivariate  Statistical  Analysis, 

DOI  10.1007/978-3-662-45171-7  13 


385 


386 


13  Cluster  Analysis 


shows  how  to  precisely  measure  the  proximity  between  objects.  Finally,  Sect.  13.3 
provides  some  algorithms.  We  will  concentrate  on  hierarchical  algorithms  only 
where  the  number  of  clusters  is  not  known  in  advance. 


13.1  The  Problem 

Cluster  analysis  is  a  set  of  tools  for  building  groups  (clusters)  from  multivariate 
data  objects.  The  aim  is  to  construct  groups  with  homogeneous  properties  out  of 
heterogeneous  large  samples.  The  groups  or  clusters  should  be  as  homogeneous  as 
possible  and  the  differences  among  the  various  groups  as  large  as  possible.  Cluster 
analysis  can  be  divided  into  two  fundamental  steps. 

1.  Choice  of  a  proximity  measure: 

One  checks  each  pair  of  observations  (objects)  for  the  similarity  of  their  values. 
A  similarity  (proximity)  measure  is  defined  to  measure  the  “closeness”  of  the 
objects.  The  “closer”  they  are,  the  more  homogeneous  they  are. 

2.  Choice  of  group -building  algorithm: 

On  the  basis  of  the  proximity  measures  the  objects  assigned  to  groups  so  that 
differences  between  groups  become  large  and  observations  in  a  group  become  as 
close  as  possible. 

In  marketing,  for  example,  cluster  analysis  is  used  to  select  test  markets.  Other 
applications  include  the  classification  of  companies  according  to  their  organisational 
structures,  technologies  and  types.  In  psychology,  cluster  analysis  is  used  to  find 
types  of  personalities  on  the  basis  of  questionnaires.  In  archaeology,  it  is  applied 
to  classify  art  objects  in  different  time  periods.  Other  scientific  branches  that  use 
cluster  analysis  are  medicine,  sociology,  linguistics  and  biology.  In  each  case  a 
heterogeneous  sample  of  objects  are  analysed  with  the  aim  to  identify  homogeneous 
sub-groups. 


'  Summary 

^  Cluster  analysis  is  a  set  of  tools  for  building  groups  (clusters)  from 
multivariate  data  objects. 

^  The  methods  used  are  usually  divided  into  two  fundamental  steps: 
The  choice  of  a  proximity  measure  and  the  choice  of  a  group¬ 
building  algorithm. 
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13.2  The  Proximity  Between  Objects 

The  starting  point  of  a  cluster  analysis  is  a  data  matrix  X(n  x  p )  with  n 
measurements  (objects)  of  p  variables.  The  proximity  (similarity)  among  objects 
is  described  by  a  matrix  V(n  x  n) 


/  d\\  d\2 . d\n  \ 

■  d22  • 

...  • 

V=  :  :  :  (13.1) 

.  .  .  . 

•  •  *  • 

•  •  .  • 

•  •  *  • 

V  dn\  dn 2 . d n n  J 


The  matrix  V  contains  measures  of  similarity  or  dissimilarity  among  the  n  objects.  If 
the  values  dtj  are  distances,  then  they  measure  dissimilarity.  The  greater  the  distance, 
the  less  similar  are  the  objects.  If  the  values  dtj  are  proximity  measures,  then  the 
opposite  is  true,  i.e.  the  greater  the  proximity  value,  the  more  similar  are  the  objects. 
A  distance  matrix,  for  example,  could  be  defined  by  the  L2-norm:  dij  =  ||x7  —  Xj  ||2, 
where  x7  and  Xj  denote  the  rows  of  the  data  matrix  A.  Distance  and  similarity  are  of 
course  dual.  If  d \j  is  a  distance,  then  d[j  —  max7  j{dij}  —  d^  is  a  proximity  measure. 

The  nature  of  the  observations  plays  an  important  role  in  the  choice  of  proximity 
measure.  Nominal  values  (like  binary  variables)  lead  in  general  to  proximity  values, 
whereas  metric  values  lead  (in  general)  to  distance  matrices.  We  first  present 
possibilities  for  V  in  the  binary  case  and  then  consider  the  continuous  case. 


Similarity  of  Objects  with  Binary  Structure 

In  order  to  measure  the  similarity  between  objects  we  always  compare  pairs  of 
observations  (x7 ,  xj)  where  xj  —  (xn, . . . ,  x7p),  xj  —  (xj\ , . . . ,  xp),  and  x7^,  xp  e 
{0, 1}.  Obviously  there  are  four  cases: 


Xik  —  Xjk  —  t , 
Xik  —  0  ?  Xjk  —  1 , 
%ik  —  1  ?  Xjk  —  0  ? 
%ik  =  Xjk  —  0* 
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Define 


p 

a\  =  ^2  I{xik  =  xjk  =  1), 
k  =  1 

P 

a2  =  ^3  =  0,  Xjk  =  1), 

k  =  1 

P 

a3  -  ^3  Pxik  =  1  >*/*  =  0), 

k  =  1 

P 

aA  —  E  7(x*  =  =  °)- 

/<C  =1 


Note  that  each  ai,l  —  l, ...  ,4,  depends  on  the  pair  (x? ,  Xj ) . 
The  following  proximity  measures  are  used  in  practice: 


CL\  T  $$4 

ci\  T  8ci/\.  T  A($2  T  $3) 


(13.2) 


where  8  and  A  are  weighting  factors.  Table  13.1  shows  some  similarity  measures  for 
given  weighting  factors. 

These  measures  provide  alternative  ways  of  weighting  mismatching  and  positive 
(presence  of  a  common  character)  or  negative  (absence  of  a  common  character) 
matchings.  In  principle,  we  could  also  consider  the  Euclidean  distance.  However, 
the  disadvantage  of  this  distance  is  that  it  treats  the  observations  0  and  1  in  the  same 
way.  If  xik  —  1  denotes,  say,  knowledge  of  a  certain  language,  then  the  contrary, 
Xik  =  0  (not  knowing  the  language)  should  eventually  be  treated  differently. 

Example  13.1  Let  us  consider  binary  variables  computed  from  the  car  data  set 
(Table  22.7).  We  define  the  new  binary  data  by 


yik  — 


1  if  Xik  >  Xk , 

0  otherwise, 


Table  13.1  The  common 
similarity  coefficients 


Name 

5 

A 

Definition 

Jaccard 

0 

1 

a\ 

ci  1  T  0-2  T  a?, 

Tanimoto 

1 

2 

a\  +  <24 

ci\  T  2(<22  T  $3)  T  <24 

Simple  matching  (M) 

1 

1 

a  1  +  <24 

P 

Russel  and  Rao  (RR) 

— 

— 

<2 1 

P 

Dice 

0 

0.5 

2a\ 

2 d\  T  (<22  T  <23) 

Kulczynski 

d  \ 

d2  +  <23 
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for  i  =  1 , ,n  and  k  =  l, p.  This  means  that  we  transform  the  observations 
of  the  k- th  variable  to  1  if  it  is  larger  than  the  mean  value  of  all  observations  of  the 
k- th  variable.  Let  us  only  consider  the  data  points  17  to  19  (Renault  19,  Rover  and 
Toyota  Corolla)  which  lead  to  (3  x  3)  distance  matrices.  The  Jaccard  measure  gives 
the  similarity  matrix 


/  1.000  0.000  0.400  \ 
£>={  1.000  0.167  , 

V  l.ooo/ 

the  Tanimoto  measure  yields 

/1.000  0.000  0.455  \ 
£>=[  1.000  0.231  , 

V  1.000/ 

whereas  the  Simple  Matching  measure  gives 

/1.000  0.000  0.625  \ 
£>=[  1.000  0.375  . 

V  l.ooo/ 


Distance  Measures  for  Continuous  Variables 

A  wide  variety  of  distance  measures  can  be  generated  by  the  Lr -norms,  r  >  1, 

p 

Li  I Xik  ~  xJk\r 

k  =  1 


dij  — 


Xj  —  Xj 


(13.3) 


Here  x&  denotes  the  value  of  the  k- th  variable  on  object  i .  It  is  clear  that  da  =  0  for 
i  =  1 The  class  of  distances  (13.3)  for  varying  r  measures  the  dissimilarity 
of  different  weights.  The  L\ -metric,  for  example,  gives  less  weight  to  outliers  than 
the  L2-norm  (Euclidean  norm).  It  is  common  to  consider  the  squared  L2-norm. 

Example  13.2  Suppose  we  have  x\  —  (0, 0),  X2  =  (1,0)  and  X3  =  (5,  5).  Then  the 
distance  matrix  for  the  Li-norm  is 

/  0  1  10\ 

T>i=[  10  9  , 

\  10  9  0/ 
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and  for  the  squared  L2-  or  Euclidean  norm 

/  0  1  50  \ 
V2  =  1  0  41  . 

\50  41  0/ 


One  can  see  that  the  third  observation  X3  receives  much  more  weight  in  the  squared 
L2-norm  than  in  the  Li-norm. 

An  underlying  assumption  in  applying  distances  based  on  Lr -norms  is  that  the 
variables  are  measured  on  the  same  scale.  If  this  is  not  the  case,  a  standardisation 
should  first  be  applied.  This  corresponds  to  using  a  more  general  L2-  or  Euclidean 
norm  with  a  metric  A,  where  A  >  0  (see  Sect.  2.6): 


Xi  ~  X 


A 


=  (Xi 


Xj)~A(xj 


(13.4) 


L2-norms  are  given  by  A  =  Tp,  but  if  a  standardisation  is  desired,  then  the 
weight  matrix  A  —  diag(^Zi, . . . ,  s~xXpxp)  may  suitable.  Recall  that  sxkxk  is 
the  variance  of  the  k- th  component.  Hence  we  have 


dfj  -  =  t  (Xik  Xjk ) 

k  =  1 


sxkxk 


(13.5) 


Here  each  component  has  the  same  weight  in  the  computation  of  the  distances  and 
the  distances  do  not  depend  on  a  particular  choice  of  the  units  of  measure. 

Example  13.3  Consider  the  French  Food  expenditures  (Table  22.6).  The  Euclidean 
distance  matrix  (squared  L2-norm)  is 


/  0.00  5.82  58.19  3.54  5.15  151.44  16.91  36.15  147.99  51.84  102.56  271.83  \ 


0.00  41.73 

4.53 

2.93 

120.59 

13.52 

25.39 

116.31 

43.68 

76.81 

226.87 

0.00 

44.14 

40.10 

24.12 

29.95 

8.17 

25.57 

20.81 

20.30 

88.62 

0.00 

0.76 

127.85 

5.62 

21.70 

124.98 

31.21 

72.97 

231.57 

0.00 

121.05 

5.70 

19.85 

118.77 

30.82 

67.39 

220.72 

0.00 

96.57 

48.16 

1.80 

60.52 

28.90 

29.56 

0.00 

9.20 

94.87 

11.07 

42.12 

179.84 

0.00 

46.95 

6.17 

18.76 

113.03 

0.00 

61.08 

29.62 

31.86 

0.00 

15.83 

116.11 

0.00 

53.77 

0.00  / 
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Taking  the  weight  matrix  A  =  diag(^Zi1Zi , . . . ,  sxxx),  we  obtain  the  distance  matrix 
(squared  L  2 -norm) 


/0.00  6.85 

10.04 

1.68 

2.66 

24.90 

8.28 

8.56 

24.61 

21.55 

30.68 

57.48  \ 

0.00 

13.11 

6.59 

3.75 

20.12 

13.13 

12.38 

15.88 

31.52 

25.65 

46.64 

0.00 

8.03 

7.27 

4.99 

9.27 

3.88 

7.46 

14.92 

15.08 

26.89 

0.00 

0.64 

20.06 

2.76 

3.82 

19.63 

12.81 

19.28 

45.01 

0.00 

17.00 

3.54 

3.81 

15.76 

14.98 

16.89 

39.87 

0.00 

17.51 

9.79 

1.58 

21.32 

11.36 

13.40 

0.00 

1.80 

17.92 

4.39 

9.93 

33.61 

0.00 

10.50 

5.70 

7.97 

24.41 

0.00 

24.75 

11.02 

13.07 

0.00 

9.13 

29.78 

0.00 

9.39 

V  0.00  / 

(13.6) 

When  applied  to  contingency  tables,  a  /2 -metric  is  suitable  to  compare  (and 
cluster)  rows  and  columns  of  a  contingency  table. 

If  A'  is  a  contingency  table,  row  i  is  characterised  by  the  conditional  frequency 
distribution  where  X/.  =  X^=t  xii  indicates  the  marginal  distributions  over 
the  rows:  x ••  =  YH=  1  Similarly,  column  j  of  is  characterised  by  the 

conditional  frequencies  where  x.j  —  Y^=  1  xij-  The  marginal  frequencies  of 
the  columns  are  — . 

X#* 

The  distance  between  two  rows,  i\  and  z‘2 ,  corresponds  to  the  distance  between 
their  respective  frequency  distributions.  It  is  common  to  define  this  distance  using 
the  x2 -metric: 


p 


1 


d\iuh)  =  J2 1  . 

j~x  yx##  J 


x 


nj 


xhj 

Xi2% 


(13.7) 


Note  that  this  can  be  expressed  as  a  distance  between  the  vectors  x\  —  and 

X2  —  (^—7^  as  in  (13.4)  with  weighting  matrix  A  —  jdiag  •  Similarly,  if 

we  are  interested  in  clusters  among  the  columns,  we  can  define: 


n 


d2(j\,  j 2)  =  ^ 


Xij  1 


/  =  1 


(at) 
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Apart  from  the  Euclidean  and  the  Lr-norm  measures  one  can  use  a  proximity 
measure  such  as  the  Q  -correlation  coefficient 


Yk  =  1 C xik  ~xi) ( Xjk  ~Xj) 


dy  — 


—  1/2 


{ELA*  -  xX  Yk=i(xjk  -  xj)2} 


(13.8) 


Here  x\  denotes  the  mean  over  the  variables  (xn, ... ,  XiP) 


U1  *4  % 


'  Summary 

^  The  proximity  between  data  points  is  measured  by  a  distance 
or  similarity  matrix  V  whose  components  dy  give  the  similarity 
coefficient  or  the  distance  between  two  points  Xf  and  Xj . 

^  A  variety  of  similarity  (distance)  measures  exist  for  binary  data 
(e.g.  Jaccard,  Tanimoto,  Simple  Matching  coefficients)  and  for 
continuous  data  (e.g.  Lr -norms). 

The  nature  of  the  data  could  impose  the  choice  of  a  particular 
metric  A  in  defining  the  distances  (standardisation,  /2 -metric  etc.). 


13.3  Cluster  Algorithms 

There  are  essentially  two  types  of  clustering  methods:  hierarchical  algorithms  and 
partitioning  algorithms.  The  hierarchical  algorithms  can  be  divided  into  agglomer- 
ative  and  splitting  procedures.  The  first  type  of  hierarchical  clustering  starts  from 
the  finest  partition  possible  (each  observation  forms  a  cluster)  and  groups  them.  The 
second  type  starts  with  the  coarsest  partition  possible:  one  cluster  contains  all  of  the 
observations.  It  proceeds  by  splitting  the  single  cluster  up  into  smaller  sized  clusters. 

The  partitioning  algorithms  start  from  a  given  group  definition  and  proceed  by 
exchanging  elements  between  groups  until  a  certain  score  is  optimised.  The  main 
difference  between  the  two  clustering  techniques  is  that  in  hierarchical  clustering 
once  groups  are  found  and  elements  are  assigned  to  the  groups,  this  assignment 
cannot  be  changed.  In  partitioning  techniques,  on  the  other  hand,  the  assignment  of 
objects  into  groups  may  change  during  the  algorithm  application. 
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Hierarchical  Algorithms,  Agglomerative  Techniques 

Agglomerative  algorithms  are  used  quite  frequently  in  practice.  The  algorithm 
consists  of  the  following  steps: 


Algorithm  Hierarchical  algorithms-agglomerative  technique 

1 :  Construct  the  finest  partition 
2:  Compute  the  distance  matrix  V. 

3:  repeat 

4:  Find  the  two  clusters  with  the  closest  distance 

5:  Put  those  two  clusters  into  one  cluster 

6:  Compute  the  distance  between  the  new  groups  and  obtain  a  reduced  distance  matrix  V 

7 :  until  all  clusters  are  agglomerated  into  A 


If  two  objects  or  groups  say,  P  and  Q ,  are  united,  one  computes  the  distance 
between  this  new  group  (object)  P  +  Q  and  group  R  using  the  following  distance 
function: 

d(R,  P  +  Q)  =  8xd(R ,  P)  +  82d(R,  Q )  +  83d(P ,  Q)  +  84\d(R,  P)  -  d(R ,  Q) |. 

(13.9) 

The  8 j  ’s  are  weighting  factors  that  lead  to  different  agglomerative  algorithms  as 
described  in  Table  13.2.  Here  rip  —  YH=i  I(xi  G  P)  Is  the  number  of  objects  in 
group  P.  The  values  of  hq  and  hr  are  defined  analogously. 

For  the  most  common  used  Single  and  Complete  linkages,  below  are  the  modified 
agglomerative  algorithm  steps: 

As  instead  of  computing  new  distance  matrixes  every  step,  a  linear  search  in  the 
original  distance  matrix  is  enough  for  clustering  in  the  modified  algorithm,  it  is  more 
efficient  in  practice. 


Table  13.2  Computations  of  group  distances 


Name 

«i 

82 

83 

h 

Single  linkage 

1/2 

1/2 

0 

-1/2 

Complete  linkage 

1/2 

1/2 

0 

1/2 

Average  linkage  (unweighted) 

1/2 

1/2 

0 

0 

Average  linkage  (weighted) 

n  P 

nQ 

0 

0 

np  +  riQ 

np  +  nQ 

Centroid 

n  p 

nQ 

npnQ 

0 

np  +  hq 

np  +  nQ 

(np  +  nQ)2 

Median 

1/2 

1/2 

-1/4 

0 

Ward 

n  r  ~\~  n  p 

nR  +  nQ 

nR 

0 

iiR  +  n  p  +  n  q 

nR  +  n  p  +  nQ 

n  p  +  np  +  n  q 
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Algorithm  Modified  hierarchical  algorithms-agglomerative  technique 

1 :  Construct  the  finest  partition 
2:  Compute  the  distance  matrix  V. 

3:  repeat 

4:  Find  the  smallest  (Single  linkage)/  largest  (Complete  linkage)  value  d  (between  objects  m 

and  n )  in  V 

5:  If  m  and  n  are  not  in  the  same  cluster,  combine  the  clusters  m  and  n  belonging  to  together, 

and  delete  the  smallest  value 

6:  until  all  clusters  are  agglomerated  into  A  or  the  value  d  exceeds  the  preset  level 


Example  13.4  Let  us  examine  the  agglomerative  algorithm  for  the  three  points  in 
Example  13.2,  x\  —  (0,  0),  x2  —  (1, 0)  and  X3  =  (5,  5),  and  the  squared  Euclidean 
distance  matrix  with  single  linkage  weighting.  The  algorithm  starts  with  N  =  3 
clusters:  P  —  {x\ },  Q  —  {X2}  and  R  =  {X3}.  The  distance  matrix  V2  is  given  in 
Example  13.2.  The  smallest  distance  in  V2  is  the  one  between  the  clusters  P  and 
Q .  Therefore,  applying  step  4  in  the  above  algorithm  we  combine  these  clusters  to 
form  P  +  Q  —  {x\ ,  x2}.  The  single  linkage  distance  between  the  remaining  two 
clusters  is  from  Table  13.2  and  (13.9)  equal  to 


d(R,P  +  Q)  = 


l-d(R,P)+l 2-d(R,Q) 

-d\3  +  -d23  —  -  •  1^13 

50  41  1  , 

- h - 50-41 

2  2  2 


l-\d(R,P)-d(R,Q)\ 


—  d23 1 


=  41. 


(13.10) 


The  reduced  distance  matrix  is  then  ^  ^  4J  j .  The  next  and  last  step  is  to  unite  the 
clusters  R  and  P  +  Q  into  a  single  cluster  the  original  data  matrix. 

When  there  are  more  data  points  than  in  the  example  above,  a  visualisation  of 
the  implication  of  clusters  is  desirable.  A  graphical  representation  of  the  sequence 
of  clustering  is  called  a  dendrogram.  It  displays  the  observations,  the  sequence  of 
clusters  and  the  distances  between  the  clusters.  The  vertical  axis  displays  the  indices 
of  the  points,  whereas  the  horizontal  axis  gives  the  distance  between  the  clusters. 
Large  distances  indicate  the  clustering  of  heterogeneous  groups.  Thus,  if  we  choose 
to  “cut  the  tree”  at  a  desired  level,  the  branches  describe  the  corresponding  clusters. 
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Fig.  13.1  The  8-point 
example  Q  MVAclus8p 


8  points 


first  coordinate 


Example  13.5  Here  we  describe  the  single  linkage  algorithm  for  the  eight  data 
points  displayed  in  Fig.  13.1.  The  distance  matrix  (L2-norms)  is 

/0  10  53  73  50  98  41  65  \ 

0  25  41  20  80  37  65 
0  2  1  25  18  34 
v  _  0  5  17  20  32 

0  36  25  45 
0  13  9 
0  4 

V  oj 


and  the  dendrogram  is  shown  in  Fig.  13.2. 

If  we  decide  to  cut  the  tree  at  the  level  10,  three  clusters  are  defined:  {1,2}, 
{3,4,5}  and  {6, 7,  8}. 

The  single  linkage  algorithm  defines  the  distance  between  two  groups  as  the 
smallest  value  of  the  individual  distances.  Table  13.2  shows  that  in  this  case 

d(R,P  +  Q)  =  min {d(R,P),d(R,  Q)}.  (13.11) 

This  algorithm  is  also  called  the  Nearest  Neighbour  algorithm.  As  a  consequence 
of  its  construction,  single  linkage  tends  to  build  large  groups.  Groups  that  differ  but 
are  not  well  separated  may  thus  be  classified  into  one  group  as  long  as  they  have 


396 


13  Cluster  Analysis 


Fig.  13.2  The  dendrogram 
for  the  8-point  example, 
single  linkage  algorithm  Q 
MVAclus8p 


Single  Linkage  Dendrogram  -  8  points 
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two  approximate  points.  The  complete  linkage  algorithm  tries  to  correct  this  kind 
of  grouping  by  considering  the  largest  (individual)  distances.  Indeed,  the  complete 
linkage  distance  can  be  written  as 

d(R,  P  +  Q)  =  ma x{d(R,  P ),  d(R,  Q)}.  (13.12) 

It  is  also  called  the  Farthest  Neighbour  algorithm.  This  algorithm  will  cluster 
groups  where  all  the  points  are  proximate,  since  it  compares  the  largest  distances. 
The  average  linkage  algorithm  (weighted  or  unweighted)  proposes  a  compromise 
between  the  two  preceding  algorithms,  in  that  it  computes  an  average  distance: 

d(R,  P  +  Q)  =  — ^ — d(R,  P )  +  — ^ — d(R,  Q ).  (13.13) 

nP  +nQ  nP  +  nQ 

The  centroid  algorithm  is  quite  similar  to  the  average  linkage  algorithm  and  uses 
the  natural  geometrical  distance  between  R  and  the  weighted  centre  of  gravity  of  P 
and  Q  (see  Fig.  13.3): 

d(R,  P  +  Q)  =  d(R,  P )  +  d(R,  Q )  -  Hp"Q  d(P,  Q). 

nP+nQ  nP+nQ  (nP+nQ)z 

(13.14) 

The  Ward  clustering  algorithm  computes  the  distance  between  groups  according 
to  the  formula  in  Table  13.2.  The  main  difference  between  this  algorithm  and  the 
linkage  procedures  is  in  the  unification  procedure.  The  Ward  algorithm  does  not  put 
together  groups  with  smallest  distance.  Instead,  it  joins  groups  that  do  not  increase 
a  given  measure  of  heterogeneity  “too  much”.  The  aim  of  the  Ward  procedure  is 
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Fig.  13.3  The  centroid 
algorithm 


R 


P 


Q 


weighted  centre  of  gravity  of  P  +  Q 


to  unify  groups  such  that  the  variation  inside  these  groups  does  not  increase  too 
drastically:  the  resulting  groups  are  as  homogeneous  as  possible. 

The  heterogeneity  of  group  R  is  measured  by  the  inertia  inside  the  group.  This 
inertia  is  defined  as  follows: 


^  HR 

Ir  =  — 

Ur 

where  xr  is  the  centre  of  gravity  (mean)  over  the  groups.  I r  clearly  provides  a 
scalar  measure  of  the  dispersion  of  the  group  around  its  centre  of  gravity.  If  the 
usual  Euclidean  distance  is  used,  then  Ir  represents  the  sum  of  the  variances  of  the 
p  components  of  Xj  inside  group  R . 

When  two  objects  or  groups  P  and  Q  are  joined,  the  new  group  P  +  Q  has  a 
larger  inertia  Ip+Q.  It  can  be  shown  that  the  corresponding  increase  of  inertia  is 
given  by 


/_d2(xj,xR) 


(13.15) 


A (P,Q)=  ne"Q  d2(P,  Q).  (13.16) 

nP  +  nQ 

In  this  case,  the  Ward  algorithm  is  defined  as  an  algorithm  that  “joins  the  groups 
that  give  the  smallest  increase  in  A (P,  Q)”.  It  is  easy  to  prove  that  when  P  and  Q 
are  joined,  the  new  criterion  values  are  given  by  (13.9)  along  with  the  values  of  <5/ 
given  in  Table  13.2,  when  the  centroid  formula  is  used  to  modify  d2(R,  P  +  Q ). 
So,  the  Ward  algorithm  is  related  to  the  centroid  algorithm,  but  with  an  “inertial” 
distance  A  rather  than  the  “geometric”  distance  d2. 

As  pointed  out  in  Sect.  13.2,  all  the  algorithms  above  can  be  adjusted  by  the 
choice  of  the  metric  A  defining  the  geometric  distance  d2.  If  the  results  of  a 
clustering  algorithm  are  illustrated  as  graphical  representations  of  individuals  in 
spaces  of  low  dimension  (using  principal  components  (normalised  or  not)  or  using 
a  correspondence  analysis  for  contingency  tables),  it  is  important  to  be  coherent  in 
the  choice  of  the  metric  used. 
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Fig.  13.4  PCA  for  20 
randomly  chosen  bank  notes 
Q  MVAclusbank 


20  Swiss  bank  notes 


first  PC 


Fig.  13.5  The  dendrogram 
for  the  20  bank  notes,  Ward 
algorithm  Q  MVAclusbank 
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Dendrogram  for  20  Swiss  bank  notes 
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Ward  algorithm 


Example  13.6  As  an  example  we  randomly  select  20  observations  from  the  bank 
notes  data  and  apply  the  Ward  technique  using  Euclidean  distances.  Figure  13.4 
shows  the  first  two  PCs  of  these  data,  Fig.  13.5  displays  the  dendrogram. 

Example  13.7  Consider  the  French  food  expenditures.  As  in  Chap.  11  we  use 
standardised  data  which  is  equivalent  to  using  A  —  diag^^1^  , . . . ,  Sx]Xl)  as 
the  weight  matrix  in  the  L2-norm.  The  NPCA  plot  of  the  individuals  was  given 
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Fig.  13.6  The  dendrogram 
for  the  French  food 
expenditures,  Ward  algorithm 
O  MVAclusfood 
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in  Fig.  11.7.  The  Euclidean  distance  matrix  is  of  course  given  by  (13.6).  The 
dendrogram  obtained  by  using  the  Ward  algorithm  is  shown  in  Fig.  13.6. 

If  the  aim  was  to  have  only  two  groups,  as  can  be  seen  in  Fig.  13.6,  they  would 
be  {CA2,  CA3,  CA4,  CA5,  EM5}  and  {MA2,  MA3,  MA4,  MA5,  EM2,  EM3, 
EM4}.  Clustering  three  groups  is  somewhat  arbitrary  (the  levels  of  the  distances 
are  too  similar).  If  we  were  interested  in  four  groups,  we  would  obtain  {CA2, 
CA3,  CA4},  {EM2,  MA2,  EM3,  MA3},  {EM4,  MA4,  MA5}  and  {EM5,  CA5}. 
This  grouping  shows  a  balance  between  socio-professional  levels  and  size  of  the 
families  in  determining  the  clusters.  The  four  groups  are  clearly  well  represented  in 
the  NPCA  plot  in  Fig.  1 1 .7. 
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'  Summary 

The  class  of  clustering  algorithms  can  be  divided  into  two  types: 
hierarchical  and  partitioning  algorithms.  Hierarchical  algorithms 
start  with  the  finest  (coarsest)  possible  partition  and  put  groups 
together  (split  groups  apart)  step  by  step.  Partitioning  algorithms 
start  from  a  preliminary  clustering  and  exchange  group  elements 
until  a  certain  score  is  reached. 

^  Hierarchical  agglomerative  techniques  are  frequently  used  in  prac¬ 
tice.  They  start  from  the  finest  possible  structure  (each  data  point 
forms  a  cluster),  compute  the  distance  matrix  for  the  clusters 
and  join  the  clusters  that  have  the  smallest  distance.  This  step  is 
repeated  until  all  points  are  united  in  one  cluster. 
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Summary  (continued) 

The  agglomerative  procedure  depends  on  the  definition  of  the 
distance  between  two  clusters.  Single  linkage,  complete  linkage, 
and  Ward  distance  are  frequently  used  distances. 

The  process  of  the  unification  of  clusters  can  be  graphically 
represented  by  a  dendrogram. 


13.4  Boston  Housing 

Presented  multivariate  techniques  are  now  applied  to  the  Boston  Housing  data.  We 
focus  our  attention  to  14  transformed  and  standardised  variables,  see  e.g.  Fig.  13.7 
that  provides  descriptive  statistics  via  boxplots  for  two  clusters,  as  discussed  in  the 


Fig.  13.7  Boxplots  of  the  14  standardised  variables  of  the  Boston  housing  data  Q  MVAclusbh 
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Fig.  13.8  Dendrogram  of  the 
Boston  housing  data  using  the 
Ward  algorithm  Q 
MVAclusbh 


Ward  method 


Table  13.3  Means  and 
standard  errors  of  the  13 
standardised  variables  for 
Cluster  1  (251  observations) 
and  Cluster  2  (255 
observations)  Q 
MVAclusbh 


Variable 

Mean  Cl 

SE  Cl 

Mean  C2 

SE  C2 

1 

-0.7105 

0.0332 

0.6994 

0.0535 

2 

0.4848 

0.0786 

-0.4772 

0.0047 

3 

-0.7665 

0.0510 

0.7545 

0.0279 

5 

-0.7672 

0.0365 

0.7552 

0.0447 

6 

0.4162 

0.0571 

-0.4097 

0.0576 

7 

-0.7730 

0.0429 

0.7609 

0.0378 

8 

0.7140 

0.0472 

-0.7028 

0.0417 

9 

-0.5429 

0.0358 

0.5344 

0.0656 

10 

-0.6932 

0.0301 

0.6823 

0.0569 

11 

-0.5464 

0.0469 

0.5378 

0.0582 

12 

0.3547 

0.0080 

-0.3491 

0.0824 

13 

-0.6899 

0.0401 

0.6791 

0.0509 

14 

0.5996 

0.0431 

-0.5902 

0.0570 

sequel.  A  dendrogram  for  13  variables(excluding  the  dummy  variable  X4 — Charles 
River  indicator)  using  the  Ward  method  is  displayed  in  Fig.  13.8.  One  observes  two 
dominant  clusters.  A  further  refinement  of  say,  four  clusters,  could  be  considered  at 
a  lower  level  of  distance. 

To  interpret  the  two  clusters,  we  present  the  mean  values  and  their  respective 
standard  errors  of  the  13  A  variables  by  groups  in  Table  13.3.  Comparison  of 
the  mean  values  for  both  groups  shows  that  all  the  differences  in  the  means  are 
individually  significant.  Moreover,  cluster  one  corresponds  to  housing  districts  with 
better  living  quality  and  higher  house  prices,  whereas  cluster  two  corresponds  to  less 
favored  districts  in  Boston.  This  can  be  confirmed,  for  instance,  by  a  lower  crime 
rate,  a  higher  proportion  of  residential  land,  lower  proportion  of  African  American, 
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Fig.  13.9  Scatterplot  matrix  for  variables  X\  to  X1  of  the  Boston  housing  data  Q  MVAclusbh 


etc.  for  cluster  one.  Cluster  two  is  identified  by  a  higher  proportion  of  older  houses, 
a  higher  pupil/teacher  ratio  and  a  higher  percentage  of  the  lower  status  population. 

This  interpretation  is  underlined  by  visual  inspection  of  all  the  variables  via 
scatterplot  matrices,  see  e.g.  Figs.  13.9  and  13.10.  For  example,  the  lower  right 
boxplot  of  Fig.  13.7  and  the  correspondingly  coloured  clusters  in  the  last  row 
of  Fig.  13.10  confirm  the  role  of  each  variable  in  determining  the  clusters.  This 
interpretation  perfectly  coincides  with  the  previous  PC  analysis  (Fig.  11.11).  The 
quality  of  life  factor  is  clearly  visible  in  Fig.  13.11,  where  cluster  membership  is 


13.4  Boston  Housing 


403 


0.5  2.0 


o 

c\i 


LO 

o 


j  I  I  I  i 

X8 


o 

CO 

LO 

o 

o 


% 

a®  an  oooo 

_ <24 _ o  <3BO 


X9 


0.5  2.0  0.0  1.5  3.05.2  6.0 


XII 

n 

i 

W  o 

o 

i 

o 

pj 

i f 

0  2 

1 1 1 

4  6 

LO 

CO 


o 

C\j 


2.0  3.5 


X14 

i — i — i — r 


Fig.  13.10  Scatterplot  matrix  for  variables  X8  to  Xu  of  the  Boston  housing  data  Q  MVAclusbh 


distinguished  by  the  shape  and  colour  of  the  points  graphed  according  to  the  first 
two  principal  components.  Clearly,  the  first  PC  completely  separates  the  two  clusters 
and  corresponds,  as  we  have  discussed  in  Chap.  11,  to  a  quality  of  life  and  house 
indicator. 
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Fig.  13.11  Scatterplot  of  the 
first  two  PCs  displaying  the 
two  clusters  Q  MVAclusbh 


first  vs.  second  PC 


PCI 


13.5  Exercises 


Exercise  13.1  Prove  formula  (13.16). 

Exercise  13.2  Prove  that  Ir  —  tr(<S^),  where  Sr  denotes  the  empirical  covariance 
matrix  of  the  observations  contained  in  R. 

Exercise  13.3  Prove  that 


A (R,  P  +  Q)=  —  -  — —  MR,  P )  +  —  -  ”,g  MR,  Q ) 


Hr  +  nP  +  no 

_ nR 

nR+nP  +  hq 


nR  +  nP  +  hq 


MP,Q), 


when  the  centroid  formula  is  used  to  define  d2(R,  P  +  Q ). 

Exercise  13.4  Repeat  the  8 -point  example  ( Example  13.5)  using  the  complete 
linkage  and  the  Ward  algorithm.  Explain  the  difference  to  single  linkage. 

Exercise  13.5  Explain  the  differences  between  various  proximity  measures  by 
means  of  an  example. 

Exercise  13.6  Repeat  the  bank  notes  example  (Example  13.6)  with  another  random 
sample  of  20  notes. 

Exercise  13.7  Repeat  the  bank  notes  example  (Example  13.6 )  with  another  cluster¬ 
ing  algorithm. 
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Exercise  13.8  Repeat  the  hank  notes  example  ( Example  13.6)  or  the  8-point 
example  ( Example  13.5)  with  the  L\-norm. 

Exercise  13.9  Analyse  the  US  companies  example  ( Table  22.5)  using  the  Ward 
algorithm  and  the  L 2-norm. 

Exercise  13.10  Analyse  the  US  crime  data  set  ( Table  22.10)  with  the  Ward  algo¬ 
rithm  and  the  L2~norm  on  standardised  variables  (use  only  the  crime  variables). 

Exercise  13.11  Repeat  Exercise  13.10  with  the  US  health  data  set  (use  only  the 
number  of  deaths  variables). 

Exercise  13.12  Redo  Exercise  13.10  with  the  /2  -metric.  Compare  the  results. 
Exercise  13.13  Redo  Exercise  13.11  with  the  /2 -metric  and  compare  the  results. 


Chapter  14 

Discriminant  Analysis 


Discriminant  analysis  is  used  in  situations  where  the  clusters  are  known  a  priori.  The 
aim  of  discriminant  analysis  is  to  classify  an  observation,  or  several  observations, 
into  these  known  groups.  For  instance,  in  credit  scoring,  a  bank  knows  from 
past  experience  that  there  are  good  customers  (who  repay  their  loan  without  any 
problems)  and  bad  customers  (who  showed  difficulties  in  repaying  their  loan).  When 
a  new  customer  asks  for  a  loan,  the  bank  has  to  decide  whether  or  not  to  give  the 
loan.  The  past  records  of  the  bank  provides  two  data  sets:  multivariate  observations 
Xf  on  the  two  categories  of  customers  (including  for  example  age,  salary,  marital 
status,  the  amount  of  the  loan,  etc.).  The  new  customer  is  a  new  observation  x  with 
the  same  variables.  The  discrimination  rule  has  to  classify  the  customer  into  one  of 
the  two  existing  groups  and  the  discriminant  analysis  should  evaluate  the  risk  of  a 
possible  “bad  decision”. 

Many  other  examples  are  described  below,  and  in  most  applications,  the  groups 
correspond  to  natural  classifications  or  to  groups  known  from  history  (like  in  the 
credit  scoring  example).  These  groups  could  have  been  formed  by  a  cluster  analysis 
performed  on  past  data. 

Section  14.1  presents  the  allocation  rules  when  the  populations  are  known,  i.e. 
when  we  know  the  distribution  of  each  population.  As  described  in  Sect.  14.2 
in  practice  the  population  characteristics  have  to  be  estimated  from  history.  The 
methods  are  illustrated  in  several  examples. 


14.1  Allocation  Rules  for  Known  Distributions 

Discriminant  analysis  is  a  set  of  methods  and  tools  used  to  distinguish  between 
groups  of  populations  fly  and  to  determine  how  to  allocate  new  observations  into 
groups.  In  one  of  our  running  examples  we  are  interested  in  discriminating  between 
counterfeit  and  true  bank  notes  on  the  basis  of  measurements  of  these  bank  notes, 
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see  Sect.  22.2.  In  this  case  we  have  two  groups  (counterfeit  and  genuine  bank 
notes)  and  we  would  like  to  establish  an  algorithm  (rule)  that  can  allocate  a  new 
observation  (a  new  bank  note)  into  one  of  the  groups. 

Another  example  is  the  detection  of  “fast”  and  “slow”  consumers  of  a  newly 
introduced  product.  Using  a  consumer’s  characteristics  like  education,  income, 
family  size,  amount  of  previous  brand  switching,  we  want  to  classify  each  consumer 
into  the  two  groups  just  identified. 

In  poetry  and  literary  studies  the  frequencies  of  spoken  or  written  words  and 
lengths  of  sentences  indicate  profiles  of  different  artists  and  writers.  It  can  be  of 
interest  to  attribute  unknown  literary  or  artistic  works  to  certain  writers  with  a 
specific  profile.  Anthropological  measures  on  ancient  sculls  help  in  discriminating 
between  male  and  female  bodies.  Good  and  poor  credit  risk  ratings  constitute  a 
discrimination  problem  that  might  be  tackled  using  observations  on  income,  age, 
number  of  credit  cards,  family  size,  etc. 

In  general  we  have  populations  II 7 ,  j  —  1,2,...,/  and  we  have  to  allocate 
an  observation  x  to  one  of  these  groups.  A  discriminant  rule  is  a  separation  of  the 
sample  space  (in  general  Rp)  into  sets  Rj  such  that  if  x  e  Rj ,  it  is  identified  as  a 
member  of  population  II 7 . 

The  main  task  of  discriminant  analysis  is  to  find  “good”  regions  Rj  such  that  the 
error  of  misclassification  is  small.  In  the  following  we  describe  such  rules  when  the 
population  distributions  are  known. 


Maximum  Likelihood  Discriminant  Rule 

Denote  the  densities  of  each  population  II 7  by  /7(x).  The  maximum  likelihood 
discriminant  rule  (ML  rule)  is  given  by  allocating  x  to  II 7  maximising  the 
likelihood  Lj  (x)  =  /7  (x)  =  argmax,  fj(x). 

If  several  f  give  the  same  maximum  then  any  of  them  may  be  selected. 
Mathematically,  the  sets  Rj  given  by  the  ML  discriminant  rule  are  defined  as 


Rj  —  {x  :  Lj{x)  >  Li(x)  for  i  —  1, ...,/,  i  ^  j}.  (14.1) 

By  classifying  the  observation  into  a  certain  group  we  may  encounter  a  misclas¬ 
sification  error.  For  J  —  2  groups  the  probability  of  putting  x  into  group  2  although 
it  is  from  population  1  can  be  calculated  as 


Pn  —  p(^  ^  ^2 1  no 


f\  (:v  )dx. 


(14.2) 
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Similarly  the  conditional  probability  of  classifying  an  object  as  belonging  to  the  first 
population  IT  i  although  it  actually  comes  from  112  is 

pl2  =  P(X  G  Ri\U2)  =  [  f2(x)dx.  (14.3) 

JRi 

The  misclassified  observations  create  a  cost  C(i\j)  when  a  fly  observation  is 
assigned  to  Rj .  In  the  credit  risk  example,  this  might  be  the  cost  of  a  “sour”  credit. 
The  cost  structure  can  be  pinned  down  in  a  cost  matrix: 


True  population 


Classified  population 

n! 

n2 

n! 

0 

C(  2|1) 

n2 

C(  1|2) 

0 

Let  n j  be  the  prior  probability  of  population  fly,  where  “prior”  means  the  a 
priori  probability  that  an  individual  selected  at  random  belongs  to  fly  (i.e.  before 
looking  to  the  value  x).  Prior  probabilities  should  be  considered  if  it  is  clear  ahead 
of  time  that  an  observation  is  more  likely  to  stem  from  a  certain  population  fly .  An 
example  is  the  classification  of  musical  tunes.  If  it  is  known  that  during  a  certain 
period  of  time  a  majority  of  tunes  were  written  by  a  certain  composer,  then  there  is 
a  higher  probability  that  a  certain  tune  was  composed  by  this  composer.  Therefore, 
he  should  receive  a  higher  prior  probability  when  tunes  are  assigned  to  a  specific 
group. 

The  expected  cost  of  misclassification  ( ECM )  is  given  by 

ECM  =  C(2\l)p2i7ti  +  C(l|2)/>127r2.  (14.4) 


We  will  be  interested  in  classification  rules  that  keep  the  ECM  small  or  minimise 
it  over  a  class  of  rules.  The  discriminant  rule  minimising  the  ECM  (14.4)  for  two 
populations  is  given  below. 

Theorem  14.1  For  two  given  populations,  the  rule  minimising  the  ECM  is  given  by 


Ri 

r2 


•  hFl  >  ( C(1I2A  ( 21]  I 

'  f2(x)  ~  VC(2|1)/  Ci/( 

•  hFl  <  ( C(1I2A  ( 21]  I 

'  fl(x)  \C(2|1)/  \7Tl  /  j 
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The  ML  discriminant  rule  is  thus  a  special  case  of  the  ECM  rule  for  equal 
misclassification  costs  and  equal  prior  probabilities.  For  simplicity  the  unity  cost 
case,  C(l|2)  =  C(2|l)  =  1,  and  equal  prior  probabilities,  7x2  —  711,  are  assumed  in 
the  following. 

Theorem  14.1  will  be  proven  by  an  example  from  credit  scoring. 

Example  14.1  Suppose  that  II 1  represents  the  population  of  bad  clients  who  create 
the  cost  C(2|  1)  if  they  are  classified  as  good  clients.  Analogously,  define  C(1 12)  as 
the  cost  of  loosing  a  good  client  classified  as  a  bad  one.  Let  y  denote  the  gain  of  the 
bank  for  the  correct  classification  of  a  good  client.  The  total  gain  of  the  bank  is  then 


G(R2 )  =  — C(2|1)tti  /  I(x  e  R2)fi(x)dx 


l 

/ 


— C(1|2)jt2  /  {1  -  I(x  €  R2)}f2(x)dx  +  yn2\  I(x  e  R2)  f2(x)dx 


/ 


=  —C{\\l)jt2  +  I  I(xe  R2){—C (2|  1)tti/i (x) 
+  (C(1|2)  +  y)n2f2(x)}dx 


Since  the  first  term  in  this  equation  is  constant,  the  maximum  is  obviously  obtained 
for 


R2  =  {x  :  —C(2\\)jt\f\(x)  +  { C ( 1 1 2)  +  y}ji2f2(x)  >0}. 


This  is  equivalent  to 


R  l  h(x)  >  C(2|1)tti  j 

2  \X  ■  Mx)  ~  {C(l\2)  +  y}n2]  ’ 

which  corresponds  to  the  set  R2  in  Theorem  14.1  for  a  gain  of  y  =  0. 
Example  14.2  Suppose  x  e  {0, 1}  and 


ni  :  P(X  =  0)  =  P(X  =  1)  =  l- 

n2:P(*=0)  =  ^  =  l-P(*  =  l). 

The  sample  space  is  the  set  {0, 1}.  The  ML  discriminant  rule  is  to  allocate  x  =  0  to 
fli  and  x  =  1  to  112,  defining  the  sets  R\  —  {0},  R2  =  {1}  and  R\  U  R2  —  {0, 1}. 

Example  14.3  Consider  two  normal  populations 

Ill  :  jV(/zi,of), 
n2  :  N(fi2,  a2). 


14.1  Allocation  Rules  for  Known  Distributions 


411 


Then 


L,  (x)  =  (2 Trof)  1/2  exp  |  -f 


Hence  x  is  allocated  to  H i  (x  e  R\)  if  Li(x)  >  L2(x).  Note  that  Li(x)  >  L2(x) 
is  equivalent  to 


02 

o\ 


exp 


1  \  /X-fli 

2)1  ax 


2  /  \  2 
X  -  l±2 

02 


>  1 


or 


1  1 


x^ 


2  2 

°T  a* 


—  2x 


Ml  \l2 

2  2 

°r  a2 


+ 


m? 


M2 


2  2 

°T  °2 


<  2  log 


02 

OX 


(14.5) 


Suppose  that  fi\  =  0,  o\  =  1  and  ji2  —  \,02  —  Formula  (14.5)  leads  to 


or  x 


>  \  (4  +  74  +  6 


This  situation  is  shown  in  Fig.  14.1. 


Fig.  14.1  Maximum 
likelihood  rule  for  normal 
distributions  Q 
MVAdisnorm 


2  Normal  Distributions 


412 


14  Discriminant  Analysis 


The  situation  simplifies  in  the  case  of  equal  variances  04  =  04 .  The  discriminant 
rule  (14.5)  is  then  (for  /Xi  <  fi 2) 

X  -*  ni,  if  X  e  Ri  -  {x  :  x  <  5(^1  +  fi2)},  (14  6) 

x  ->  n2,  if  x  €  f?2  =  {x  :  x  >  i(/ii  +  H2)}. 

Theorem  14.2  shows  that  the  ML  discriminant  rule  for  multinormal  observations 
is  intimately  connected  with  the  Mahalanobis  distance.  The  discriminant  rule  is 
based  on  linear  combinations  and  belongs  to  the  family  of  linear  discriminant 
analysis  (LDA)  methods. 

Theorem  14.2  Suppose  IT/  =  N p  (pLj ,  £). 

(a)  The  ML  rule  allocates  x  to  fly,  where  j  e  {1, ...,/}  is  the  value  minimising 
the  square  Mahalanobis  distance  between  x  and  \±i : 

82(x,hi)  =  (x  -  /i,-)T£-1(x  -  jit)  ,  i  =  1 

(b)  In  the  case  of  J  —  2, 


x  G  R\  aT  (x  —  pi)  >  0  , 


where  a  —  S  l(pt\  —  /x 2)  p  =  ^(/xi  +  pi 2). 

Proof  Part  (a)  of  the  theorem  follows  directly  from  comparison  of  the  likelihoods. 
For  J  —  2,  part  (a)  says  that  x  is  allocated  to  IT  1  if 


(x-/Zi)T£  '(x  -  jLti)  <  (x  -  jl2)  '  S  1  (x  -  jl2) 


Tv-1 


Rearranging  terms  leads  to 


— Ipijsi  1x  +  2/xj£  lx  +  ptjsi  1  pt\  —  /x  J  £  V2  <  0, 


which  is  equivalent  to 


2(/x2  -  /xi)T^  *x  +  (/xi  -  /x2) 1  £  A(/xi  +  Ti)  <  0, 


Tv-l 


(/Xi-/X2)TE  1  {x  -  -(/Xi  +  /x2)}  >  0, 


1 


ofT(x  —  /x)  >  0. 


□ 
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Bayes  Discriminant  Rule 

We  have  seen  an  example  where  prior  knowledge  on  the  probability  of  classifi¬ 
cation  into  n7  was  assumed.  Denote  the  prior  probabilities  by  Ttj  and  note  that 

i  71  j  —  1-  The  Bayes  rule  of  discrimination  allocates  x  to  the  Tlj  that  gives  the 
largest  value  of  jtj  f  (x),  Ttj  fj  (x)  =  max?  jrz  f  (x).  Hence,  the  discriminant  rule  is 
defined  by  Rj  =  {x  :  Ttj  fj  (x)  >  7rz  f  (x)  for  i  —  1 Obviously  the  Bayes 
rule  is  identical  to  the  ML  discriminant  rule  for  Ttj  —  1  /  J . 

A  further  modification  is  to  allocate  x  to  ny  with  a  certain  probability  07(x), 
such  that  =  \  fj  M  —  1  for  all  x.  This  is  called  a  randomised  discriminant  rule. 
A  randomised  discriminant  rule  is  a  generalisation  of  deterministic  discriminant 
rules  since 


4>j{x) 


1  if  Tlj  fj  (x)  =  max,-  jtj  f  (x) , 
0  otherwise 


reflects  the  deterministic  rules. 

Which  discriminant  rules  are  good?  We  need  a  measure  of  comparison.  Denote 


(14.7) 


as  the  probability  of  allocating  x  to  11/  if  it  in  fact  belongs  to  Tlj.  A  discriminant 
rule  with  probabilities  ptj  is  as  good  as  any  other  discriminant  rule  with  probabilities 

p'u lf 


Pa  ^  p'u  for  all  i  —  l, ,  J.  (14.8) 

We  call  the  first  rule  better  if  the  strict  inequality  in  (14.8)  holds  for  at  least  one  i .  A 
discriminant  rule  is  called  admissible  if  there  is  no  better  discriminant  rule. 

Theorem  14.3  All  Bayes  discriminant  rules  ( including  the  ML  rule )  are  admissible. 


Probability  of  Mis  classification  for  the  ML  Rule  (J  =  2) 

Suppose  that  nz  =  N p  (pt ,  £).  In  the  case  of  two  groups,  it  is  not  difficult  to 
derive  the  probabilities  of  misclassification  for  the  ML  discriminant  rule.  Consider 
for  instance  pn  =  P(x  e  R\  \  n2).  By  part  (b)  in  Theorem  14.2  we  have 


pn  =  p{q't(x  -  n)  >  o  |  n2}. 
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If  X  e  R2,  aT(X  —  fi)  ^  N  (—^82,82)  where  82  =  (/xi  —  /X2)T£  1  (/x  1  —  \i 2)  is 
the  squared  Mahalanobis  distance  between  the  two  populations,  we  obtain 

Pu  =  $  f-1 5 

Similarly,  the  probability  of  being  classified  into  population  2  although  x  stems  from 
fli  is  equal  to  p2\  —  O  (—  ^8). 


Classification  with  Different  Covariance  Matrices 


The  minimum  ECM  depends  on  the  ratio  of  the  densities  or  equivalently  on  the 

difference  log{/i  (x)}  —  log{/2(x)}.  When  the  covariance  for  both  density  functions 


differ,  the  allocation  rule  becomes  more  complicated: 


R 


1 


1 


x  :  — 


-xT(Si  1  -  S2 1 )x  +  (/ u,J Ej 1  -  n-J S2 1 )x  -  k 


>  log 


C(l|2)\  (n2 


l\C(2\l)J  V^ri/J 


1 


x  :  — 


T/^-l  ..Tv-1 


2* (ZT 


-E2)x  +  (/x1E1  -fx2T,2)x-k 


<  log 


C(l|2)\  (n2 


IVC(2|1)7  V^i  7J 


where  k  —  |  log  +  k&J ^1  Vi  —  S9 1  fi2)-  The  classification  regions 

are  defined  by  quadratic  functions.  Therefore  they  belong  to  the  family  of  quadratic 
discriminant  analysis  (QDA)  methods.  This  quadratic  classification  rule  coincides 
with  the  rules  used  when  Si  =  £2,  since  the  term  5*T(S1  1  -  V)*  disappears. 


'  Summary 

^  Discriminant  analysis  is  a  set  of  methods  used  to  distinguish  among 
groups  in  data  and  to  allocate  new  observations  into  the  existing 
groups. 

^  Given  that  data  are  from  populations  fl7  with  densities  fj ,  j  — 
1 ,...,/,  the  maximum  likelihood  discriminant  rule  (ML  rule) 
allocates  an  observation  x  to  that  population  fl7  which  has  the 
maximum  likelihood  Lj  (x)  =  fj  (x)  =  max*  fix'). 


14.2  Discrimination  Rules  in  Practice 


415 


Summary  (continued) 

^  Given  prior  probabilities  Ttj  for  populations  fly ,  Bayes  discrim¬ 
inant  rule  allocates  an  observation  x  to  the  population  fly  that 
maximises  7T/  f  (x)  with  respect  to  i .  All  Bayes  discriminant  rules 
(incl.  the  ML  rule)  are  admissible. 

^  For  the  ML  rule  and  J  —  2  normal  populations,  the  probabilities 
of  misclassification  are  given  by  pn  —  pi\  —  T>  (—  \8)  where  8  is 
the  Mahalanobis  distance  between  the  two  populations. 

Classification  of  two  normal  populations  with  different  covariance 
matrices  (ML  rule)  leads  to  regions  defined  by  a  quadratic  function. 

^  Desirable  discriminant  rules  have  a  low  ECM. 
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The  ML  rule  is  used  if  the  distribution  of  the  data  is  known  up  to  parameters. 
Suppose  for  example  that  the  data  come  from  multivariate  normal  distributions 
Np (pj ,  X).  If  we  have  J  groups  with  rij  observations  in  each  group,  we  use  xy 
to  estimate  pj ,  and  Sj  to  estimate  X.  The  common  covariance  may  be  estimated  by 


(14.9) 


with  n  —  nj  •  Thus  the  empirical  version  of  the  ML  rule  of  Theorem  14.2  is 

to  allocate  a  new  observation  x  to  fly  such  that  j  minimises 

(x  —  x;)TtS“1(x  —  X/)  for  i  e  {1, . . . ,  /}. 


Example  14.4  Let  us  apply  this  rule  to  the  Swiss  bank  notes.  The  20  randomly 
chosen  bank  notes  which  we  had  clustered  into  two  groups  in  Example  13.6  are 
used.  First  the  covariance  X  is  estimated  by  the  average  of  the  covariances  of  Tl\ 
(cluster  1)  and  n2  (cluster  2).  The  hyperplane  aT  (x  —  x)  =  0  which  separates  the 
two  populations  is  given  by 


a  =  67'  (x i  -x2)  =  (—12.18,  20.54,  —19.22,  —15.55,  —13.06, 21.43)T  , 
x  =  !(xi  +  x2)  =  (214.79, 130.05, 129.92,  9.23, 10.48, 140.46)T  . 
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Now  let  us  apply  the  discriminant  rule  to  the  entire  bank  notes  data  set.  Counting 
the  number  of  misclassifications  by 

100  200 
Y  I{aT (xj  —  x)  <  0},  I{®T (xi  —  x)  >  0}, 

i  =  \  z  —  101 

we  obtain  1  misclassified  observation  for  the  counterfeit  bank  notes  and  0  misclas- 
sification  for  the  genuine  bank  notes. 

When  J  —  3  groups,  the  allocation  regions  can  be  calculated  using 

hn(x)  =  Ccj  -  x2)TS~]  |  x  -  l(xi  +  x2) 

hu(x)  =  (xi  -  x3)t«S“1  |  x  -  l(xi  +  x3) 

/l23(x)  =  <X2  -  X3)T5171  |x  -  1(X2  +  X3) 

The  rule  is  to  allocate  x  to 


n! 

if 

h i2(x)  >  0 

and 

hu(x)  >  0 

n2 

if 

hl2(x)  <  0 

and 

h23(x)  >  0 

n3 

if 

hu(x)  <  0 

and 

A23(x)  <  0. 

Estimation  of  the  Probabilities  of  Misclassifications 

Misclassification  probabilities  are  given  by  (14.7)  and  can  be  estimated  by  replacing 
the  unknown  parameters  by  their  corresponding  estimators. 

For  the  ML  rule  for  two  normal  populations  we  obtain 

Pn  =  Pi\  =  $ 

where  82  —  (xi  —  X2)'S~l  (x\  —  X2)  is  the  estimator  for  82. 

The  probabilities  of  misclassification  may  also  be  estimated  by  the  re -substitution 
method.  We  reclassify  each  original  observation  Xf,i  —  1 , . . . ,  n  into  fli , . . . ,  fly 
according  to  the  chosen  rule.  Then  denoting  the  number  of  individuals  coming  from 
Tlj  which  have  been  classified  into  fl,  by  n y,  we  have  pij  =  an  estimator  of 

Pij.  Clearly,  this  method  leads  to  too  optimistic  estimators  of  p^,  but  it  provides  a 
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rough  measure  of  the  quality  of  the  discriminant  rule.  The  matrix  (pij)  is  called  the 
confusion  matrix  in  Johnson  and  Wichern  (1998). 

Example  14.5  In  the  above  classification  problem  for  the  Swiss  bank  notes 
(Sect.  22.2),  we  have  the  following  confusion  matrix:  O  MVAaper 


Et  i 

predicted 

n2 


true  membership 


genuine  (77)) 

counterfeit  (772) 

100 

1 

0 

99 

The  apparent  error  rate  (APER)  is  defined  as  the  fraction  of  observations  that 
are  misclassified.  The  APER,  expressed  as  a  percentage,  is 

APER  =  (  —  1  100%  =  0.5%. 

V200y 


For  the  calculation  of  the  APER  we  use  the  observations  twice:  the  first  time  to 
construct  the  classification  rule  and  the  second  time  to  evaluate  this  rule.  An  APER 
of  0.5  %  might  therefore  be  too  optimistic.  An  approach  that  corrects  for  this  bias 
is  based  on  the  holdout  procedure  of  Lachenbruch  and  Mickey  (1968).  For  two 
populations  this  procedure  is  as  follows: 

1.  Start  with  the  first  population  Tl\.  Omit  one  observation  and  develop  the 
classification  rule  based  on  the  remaining  n i  —  1,  ft  2  observations. 

2.  Classify  the  “holdout”  observation  using  the  discrimination  rule  in  Step  1 . 

3.  Repeat  steps  1  and  2  until  all  of  the  TT  i  observations  are  classified.  Count  the 
number  n'2l  of  misclassified  observations. 

4.  Repeat  steps  1  through  3  for  population  n2.  Count  the  number  n'l2  of  misclassi¬ 
fied  observations. 

Estimates  of  the  misclassification  probabilities  are  given  by 


n  2 


and 


n  i 


A  more  realistic  estimator  of  the  actual  error  rate  (AER)  is  given  by 


n\2Jr  U2l 


n2  +  ft  i 


(14.10) 
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Statisticians  favor  the  AER  (for  its  unbiasedness)  over  the  APER.  In  large  samples, 
however,  the  computational  costs  might  counterbalance  the  statistical  advantage. 
This  is  not  a  real  problem  since  the  two  misclassification  measures  are  asymptoti¬ 
cally  equivalent. 


Fisher’s  Linear  Discrimination  Function 

Another  approach  stems  from  R.  A.  Fisher.  His  idea  was  to  base  the  discriminant  rule 
on  a  projection  aT  x  such  that  a  good  separation  was  achieved.  This  LDA  projection 
method  is  called  Fisher's  linear  discrimination  function.  If 

y  =  Xa 

denotes  a  linear  combination  of  observations,  then  the  total  sum  of  squares  of  y, 
Y!i=i(yi  -  y)2’ is  equal  to 


yTny  =  aTXTnXa  =  aTTa 


(14.11) 


with  the  centering  matrix  %  —  X  —  n~l  1„  1 J  and  T  =  XTHX. 

Suppose  we  have  samples  Ay,  j  —  1 ,...,/,  from  J  populations.  Fisher’s 
suggestion  was  to  find  the  linear  combination  aT x  which  maximises  the  ratio  of 
the  between- group -sum  of  squares  to  the  within- group -sum  of  squares. 

The  within- group- sum  of  squares  is  given  by 

j  j 

J2yJnJyj  =  XlUiXia  =  aJWa,  (14.12) 

j= i  y=i 

where  Ay  denotes  the  j  -th  sub-matrix  of  y  corresponding  to  observations  of  group 
j  and  FLj  denotes  the  fij  x  n  f)  centering  matrix.  The  within-group-sum  of  squares 
measures  the  sum  of  variations  within  each  group. 

The  between-group-sum  of  squares  is 

j  j 

YJ  nj  (y  j  —  y)2  =  Jy^nJ{aT(xj  —  x)}1  =  aTBa ,  (14.13) 

j= i  j= i 

where  Jj  and  Xj  denote  the  means  of  Ay  and  Xj  and  y  and  x  denote  the  sample 
means  of  y  and  X.  The  between-group-sum  of  squares  measures  the  variation  of 
the  means  across  groups. 
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The  total  sum  of  squares  (14.11)  is  the  sum  of  the  within-group-sum  of  squares 
and  the  between-group-sum  of  squares,  i.e. 

aTTa  —  aTWa  +  aTBa. 


Fisher’s  idea  was  to  select  a  projection  vector  a  that  maximises  the  ratio 


aT  Ba 
aTWa 


(14.14) 


The  solution  is  found  by  applying  Theorem  2.5. 

Theorem  14.4  The  vector  a  that  maximises  (14.14)  is  the  eigenvector  ofW~lB 
that  corresponds  to  the  largest  eigenvalue. 

Now  a  discrimination  rule  is  easy  to  obtain: 
classify  x  into  group  j  where  aT Xj  is  closest  to  aT x,  i.e. 


x  — >  IT  7  where  j  —  argmin  \aT(x  —  x;)| 


When  J  —  2  groups,  the  discriminant  rule  is  easy  to  compute.  Suppose  that 
group  1  has  n\  elements  and  group  2  has  ft  2  elements.  In  this  case 

B=(^l)dd  T, 

where  d  =  (x\  —  X2).  W~lB  has  only  one  eigenvalue  which  equals 

tr (W_1£)  =  (pjj1)  dTW~ld, 

and  the  corresponding  eigenvector  is  a  —  W~ld.  The  corresponding  discriminant 
rule  is 


x  — >  fli  if  aT{x  —  ^ (x  1  +  X2)}  >  0, 

x  — >  IT2  if  aT{x  —  \(xx  +  X2)}  <  0. 


(14.15) 


The  Fisher  LDA  is  closely  related  to  projection  pursuit  (Chap.  20)  since  the 
statistical  technique  is  based  on  a  one -dimensional  index  aT x. 

Example  14.6  Consider  the  bank  notes  data  again.  Let  us  use  the  subscript  “g”  for 
the  genuine  and  “/”  for  the  counterfeit  bank  notes,  e.g.  Xg  denotes  the  first  hundred 
observations  of  A  and  Xf  the  second  hundred.  In  the  context  of  the  bank  data  set 
the  “between-group-sum  of  squares”  is  defined  as 

100  {(y^  -  y'f  +  (yf  -  y)2}  =  aTBa 


(14.16) 
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for  some  matrix  B.  Here,  y  and  y  *  denote  the  means  for  the  genuine  and 
counterfeit  bank  notes  and  y  =  \(yg  +  ~y /)•  The  “within-group-sum  of  squares”  is 

too  too 

-yg}2  +  -yf}2  =  aTwa,  (14 .17) 

i  =  1  i  =  1 

with  (yg)i  —  aT Xi  and  (j/)?  =  ^Tx?+ioo  for  i  —  1, . . . ,  100. 

The  resulting  discriminant  rule  consists  of  allocating  an  observation  xo  to  the 
genuine  sample  space  if 


aT(x 0  —  x)  >  0, 

with  a  —  W~l(xg  —  Xf)  (see  Exercise  14.8)  and  of  allocating  xo  to  the  counterfeit 
sample  space  when  the  opposite  is  true.  In  our  case 

a  =  (0.000, 0.029,  -0.029,  -0.039,  -0.041, 0.054)T  • 

One  genuine  and  no  counterfeit  bank  notes  are  misclassified.  Figure  14.2  shows  the 
estimated  densities  for  yg  —  aT  Xg  and  y  f  —  a1  Xf.  They  are  separated  better  than 
those  of  the  diagonals  in  Fig.  1.9. 

Note  that  the  allocation  rule  (14.15)  is  exactly  the  same  as  the  MF  rule  for  J  —  2 
groups  and  for  normal  distributions  with  the  same  covariance.  For  J  —  3  groups 
this  rule  will  be  different,  except  for  the  special  case  of  collinear  sample  means. 


Fig.  14.2  Densities  of 
projections  of  genuine  and 
counterfeit  bank  notes  by 
Fisher’s  discrimination  rule 
c'  MVAdisfbank 
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0 


-0.1 


0.1 
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>  j'-  Summary 

A  discriminant  rule  is  a  separation  of  the  sample  space  into  sets  Rj . 
An  observation  x  is  classified  as  coming  from  population  fl7  if  it 
lies  in  Rj . 

The  ECM  for  two  populations  is  given  by  ECM  =  C(2|  1)  p2\Jt\  + 
C(l\2)pi27l2- 

^  The  ML  rule  is  applied  if  the  distributions  in  the  populations  are 
known  up  to  parameters,  e.g.  for  normal  distributions  N p  (ji  j ,  E). 

^  The  ML  rule  allocates  x  to  the  population  that  exhibits  the  smallest 
Mahalanobis  distance 

S2(x;  jii)  =  (x  —  /x/)TS_1(x  -  Hi). 


The  probability  of  misclassification  is  given  by 


where  8  is  the  Mahalanobis  distance  between  /x  i  and  /x 2. 

Classification  for  different  covariance  structures  in  the  two  popula¬ 
tions  leads  to  quadratic  discrimination  rules. 

A  different  approach  is  Fisher’s  linear  discrimination  rule  which 
finds  a  linear  combination  aT x  that  maximises  the  ratio  of  the 
“between-group-sum  of  squares”  and  the  “within-group-sum  of 
squares”.  This  rule  turns  out  to  be  identical  to  the  ML  rule  when 
J  =  2  for  normal  populations. 


14.3  Boston  Housing 

One  interesting  application  of  discriminant  analysis  with  respect  to  the  Boston 
housing  data  is  the  classification  of  the  districts  according  to  the  house  values. 
The  rationale  behind  this  is  that  certain  observable  must  determine  the  value  of  a 
district,  as  in  Sect.  3.7  where  the  house  value  was  regressed  on  the  other  variables. 
Two  groups  are  defined  according  to  the  median  value  of  houses  Xu\  in  group  TT 1 
the  value  of  Xu  is  greater  than  or  equal  to  the  median  of  Xu  and  in  group  H2  the 
value  of  X14  is  less  than  the  median  of  A14. 
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Table  14.1  APER  for  price 
of  Boston  houses  Q 
MVAdiscbh 


Table  14.2  AER  for  price  of 
Boston  houses  Q  MVAaerbh 


Table  14.3  APER  for 

clusters  of  Boston  houses  Q 
MVAdiscbh 


Table  14.4  AER  for  clusters 
of  Boston  houses  Q 
MVAaerbh 


IIi 

Predicted 

n2 


re 

Predicted 

n2 


ni 

Predicted 

n2 


ni 

Predicted 

n2 


True 


ni 

n2 

216 

40 

34 

216 

True 

n  i  n2 


211 

42 

39 

214 

True 

n  i  n2 


244 

13 

7 

242 

True 

iii  n2 

244 

14 

7 

241 

The  linear  discriminant  rule,  defined  on  the  remaining  12  variables  (excluding  X4 
and  Xu)  is  applied.  After  reclassifying  the  506  observations,  we  obtain  an  APER 
of  0.146.  The  details  are  given  in  Table  14.1.  The  more  appropriate  error  rate,  given 
by  the  AER,  is  0.160  (see  Table  14.2). 

Let  us  now  turn  to  a  group  definition  suggested  by  the  Cluster  Analysis  in 
Sect.  13.4.  Group  TT 1  was  defined  by  higher  quality  of  life  and  house.  We  define 
the  linear  discriminant  rule  using  the  13  variables  from  X  excluding  X4.  Then 
we  reclassify  the  506  observations  and  we  obtain  an  APER  of  0.0395.  Details  are 
summarised  in  Table  14.3.  The  AER  turns  out  to  be  0.0415  (see  Table  14.4). 

Figure  14.3  displays  the  values  of  the  linear  discriminant  scores  (see  Theo¬ 
rem  14.2)  for  all  of  the  506  observations,  coloured  by  groups.  One  can  clearly  see 
the  APER  is  derived  from  the  seven  observations  from  group  TT  1  with  a  negative 
score  and  the  13  observations  from  group  EG  with  positive  score. 


14.4  Exercises 


423 


Fig.  14.3  Discrimination  scores  for  the  two  clusters  created  from  the  Boston  housing  data  Q 
MVAdiscbh 


14.4  Exercises 


Exercise  14.1  Prove  Theorem  14.2  (a)  and  14.2  (b). 

Exercise  14.2  Apply  the  rule  from  Theorem  14.2  (b)  for  p  —  1  and  compare  the 
result  with  that  of  Example  14.3. 

Exercise  14.3  Calculate  the  ML  discrimination  rule  based  on  observations  of  a 
one -dimensional  variable  with  an  exponential  distribution. 

Exercise  14.4  Calculate  the  ML  discrimination  rule  based  on  observations  of  a 
two-dimensional  random  variable,  where  the  first  component  has  an  exponential 
distribution  and  the  other  has  an  alternative  distribution.  What  is  the  difference 
between  the  discrimination  rule  obtained  in  this  exercise  and  the  Bayes  discrimina¬ 
tion  rule? 

Exercise  14.5  Apply  the  Bayes  rule  to  the  car  data  ( Table  22.3)  in  order  to 
discriminate  between  Japanese,  European  and  US  cars,  i.e.  J  —  3.  Consider 
only  the  “miles  per  gallon  ”  variable  and  take  the  relative  frequencies  as  prior 
probabilities. 

Exercise  14.6  Compute  Fisher's  linear  discrimination  function  for  the  20  bank 
notes  from  Example  13.6.  Apply  it  to  the  entire  bank  data  set.  How  many  obser¬ 
vations  are  misclas sifted? 

Exercise  14.7  Use  the  Fisher's  linear  discrimination  function  on  the  WAIS  data 
set  ( Table  22.12)  and  evaluate  the  results  by  re -substitution  the  probabilities  of 
misclassification. 

Exercise  14.8  Show  that  in  Example  14.6 

(a)  W  —  100  ( Sg  +  Sf),  where  Sg  andSf  denote  the  empirical  covariances  (3.6) 
and  (3.5)  w.r.t.  the  genuine  and  counterfeit  bank  notes, 

(b)  8  =  100  —  x)(xg  —  x)T  +  (xf  —  x) (xf  —  x)T}  ,  where  x  —  \Qcg  + 

Xf), 

(c)  a  —  W  x(Xg  —  Xf). 
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Exercise  14.9  Recalculate  Example  14.3  with  the  prior  probability  n\  =  |  and 
C(  2|1)  =  2C(1|2). 

Exercise  14.10  Explain  the  effect  of  changing  ii\  or  C(1  |2)  on  the  relative  location 
of  the  region  Rj ,  j  —  1 , 2. 

Exercise  14.11  Prove  that  Fisher’s  linear  discrimination  function  is  identical  to  the 
ML  rule  when  the  covariance  matrices  are  identical  (/  =  2). 

Exercise  14.12  Suppose  that  x  e  {0, 1, 2,  3, 4,  5,  6, 7,  8,  9, 10}  and 

111  :  X  ~  Bi(10,  0.2)  with  the  prior  probability  ii\  —  0.5; 

n2  :  X  -  Bi(10,  0.3)  with  the  prior  probability  7r2  =  0.3; 

n3  :  X  -  Bi(10,  0.5)  with  the  prior  probability  7r3  =  0.2. 

Determine  the  sets  R\,  Ri  and  R 3.  Bayes  discriminant  rule.) 


Chapter  15 

Correspondence  Analysis 


Correspondence  analysis  provides  tools  for  analysing  the  associations  between  rows 
and  columns  of  contingency  tables.  A  contingency  table  is  a  two-entry  frequency 
table  where  the  joint  frequencies  of  two  qualitative  variables  are  reported.  For 
instance  a  (2  x  2)  table  could  be  formed  by  observing  from  a  sample  of  n  individuals 
two  qualitative  variables:  the  individual’s  sex  and  whether  the  individual  smokes. 
The  table  reports  the  observed  joint  frequencies.  In  general  ( n  x  p )  tables  may  be 
considered. 

The  main  idea  of  correspondence  analysis  is  to  develop  simple  indices  that  will 
show  the  relations  between  the  row  and  the  columns  categories.  These  indices  will 
tell  us  simultaneously  which  column  categories  have  more  weight  in  a  row  category 
and  vice  versa.  Correspondence  analysis  is  also  related  to  the  issue  of  reducing  the 
dimension  of  the  table,  similar  to  principal  component  analysis  in  Chap.  11,  and  to 
the  issue  of  decomposing  the  table  into  its  factors  as  discussed  in  Chap.  10.  The 
idea  is  to  extract  the  indices  in  decreasing  order  of  importance  so  that  the  main 
information  of  the  table  can  be  summarised  in  spaces  with  smaller  dimensions.  For 
instance,  if  only  two  factors  (indices)  are  used,  the  results  can  be  shown  in  two- 
dimensional  graphs,  showing  the  relationship  between  the  rows  and  the  columns  of 
the  table. 

Section  15.1  defines  the  basic  notation  and  motivates  the  approach  and  Sect.  15.2 
gives  the  basic  theory.  The  indices  will  be  used  to  describe  the  statistic  measuring 
the  associations  in  the  table.  Several  examples  in  Sect.  15.3  show  how  to  provide 
and  interpret,  in  practice,  the  two-dimensional  graphs  displaying  the  relationship 
between  the  rows  and  the  columns  of  a  contingency  table. 
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15.1  Motivation 

The  aim  of  correspondence  analysis  is  to  develop  simple  indices  that  show  relations 
between  the  row  and  columns  of  a  contingency  tables.  Contingency  tables  are  very 
useful  to  describe  the  association  between  two  variables  in  very  general  situations. 
The  two  variables  can  be  qualitative  (nominal),  in  which  case  they  are  also  referred 
to  as  categorical  variables.  Each  row  and  each  column  in  the  table  represents  one 
category  of  the  corresponding  variable.  The  entry  Xy  in  the  table  X  (with  dimension 
(/ix  p))  is  the  number  of  observations  in  a  sample  which  simultaneously  fall  in  the 
zth  row  category  and  the  j  th  column  category,  for  i  —  1, . . . ,  n  and  j  —  1, . . . ,  p. 
Sometimes  a  “category”  of  a  nominal  variable  is  also  called  a  “modality”  of  the 
variable. 

The  variables  of  interest  can  also  be  discrete  quantitative  variables,  such  as  the 
number  of  family  members  or  the  number  of  accidents  an  insurance  company  had  to 
cover  during  1  year,  etc.  Here,  each  possible  value  that  the  variable  can  have  defines 
a  row  or  a  column  category.  Continuous  variables  may  be  taken  into  account  by 
defining  the  categories  in  terms  of  intervals  or  classes  of  values  which  the  variable 
can  take  on.  Thus  contingency  tables  can  be  used  in  many  situations,  implying  that 
correspondence  analysis  is  a  very  useful  tool  in  many  applications. 

The  graphical  relationships  between  the  rows  and  the  columns  of  the  table  X 
that  result  from  correspondence  analysis  are  based  on  the  idea  of  representing  all 
the  row  and  column  categories  and  interpreting  the  relative  positions  of  the  points 
in  terms  of  the  weights  corresponding  to  the  column  and  the  row.  This  is  achieved 
by  deriving  a  system  of  simple  indices  providing  the  coordinates  of  each  row  and 
each  column.  These  row  and  column  coordinates  are  simultaneously  represented  in 
the  same  graph.  It  is  then  clear  to  see  which  column  categories  are  more  important 
in  the  row  categories  of  the  table  (and  the  other  way  around). 

As  was  already  eluded  to,  the  construction  of  the  indices  is  based  on  an  idea  sim¬ 
ilar  to  that  of  PC  A.  Using  PC  A  the  total  variance  was  partitioned  into  independent 
contributions  stemming  from  the  principal  components.  Correspondence  analysis, 
on  the  other  hand,  decomposes  a  measure  of  association,  typically  the  total  value 
used  in  testing  independence,  rather  than  decomposing  the  total  variance. 

Example  15.1  The  French  “baccalaureat”  frequencies  have  been  classified  into 
regions  and  different  baccalaureat  categories,  see  Chap.  22,  Table  22.8.  Altogether 
n  —  202,100  baccalaureats  were  observed.  The  joint  frequency  of  the  region 
Ile-de-France  and  the  modality  Philosophy ,  for  example,  is  9,724.  That  is,  9,724 
baccalaureats  were  in  Ile-de-France  and  the  category  Philosophy. 

The  question  is  whether  certain  regions  prefer  certain  baccalaureat  types.  If  we 
consider,  for  instance,  the  region  Lorraine ,  we  have  the  following  percentages: 


A 

B 

C 

D 

E 

F 

G 

H 

20.5 

7.6 

15.3 

19.6 

3.4 

14.5 

18.9 

0.2 
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The  total  percentages  of  the  different  modalities  of  the  variable  baccalaureat  are 
as  follows: 


A 

B 

C 

D 

E 

F 

G 

H 

22.6 

10.7 

16.2 

22.8 

2.6 

9.7 

15.2 

0.2 

One  might  argue  that  the  region  Lorraine  seems  to  prefer  the  modalities  E,  F, 
G  and  dislike  the  specialisations  A,  B,  C,  D  relative  to  the  overall  frequency  of 
baccalaureat  type. 

In  correspondence  analysis  we  try  to  develop  an  index  for  the  regions  so  that 
this  over-  or  underrepresentation  can  be  measured  in  just  one  single  number. 
Simultaneously  we  try  to  weight  the  regions  so  that  we  can  see  in  which  region 
certain  baccalaureat  types  are  preferred. 

Example  15.2  Consider  n  types  of  companies  and  p  locations  of  these  companies. 
Is  there  a  certain  type  of  company  that  prefers  a  certain  location?  Or  is  there  a 
location  index  that  corresponds  to  a  certain  type  of  company? 

Assume  that  n  —  3,  p  —  3,  and  that  the  frequencies  are  as  follows: 

(4  0  2  \  < —  Finance 

0  1  1  I  < —  Energy 

114/  HiTech 

^  Frankfurt 
Berlin 

Munich 


The  frequencies  imply  that  four  type  three  companies  (HiTech)  are  in  location  3 
(Munich),  and  so  on.  Suppose  there  is  a  (company)  weight  vector  r  =  (ri , . . . ,  rn)T 
such  that  a  location  index  sj  could  be  defined  as 

sj=c'Y^ri—,  (15.1) 

1  X%  / 

1=  1  J 

where  xmj  =  Y^=i  xij  is  the  number  of  companies  in  location  j  and  c  is  a  constant. 
S\,  for  example,  would  give  the  average  weighted  frequency  (by  r)  of  companies  in 
location  1  (Frankfurt). 

Given  a  location  weight  vector  s*  =  ^*, . . . ,  ,  we  can  define  a  company 

index  in  the  same  way  as 


(15.2) 
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where  c*  is  a  constant  and  x7  #  =  Pj=\  xij  the  sum  of  the  i  th  row  of  A,  i.e.  the 
number  of  type  i  companies.  Thus  r* ,  for  example,  would  give  the  average  weighted 
frequency  (by  s*)  of  energy  companies. 

If  (15.1)  and  (15.2)  can  be  solved  simultaneously  for  a  “row  weight”  vector 
r  =  (n, . . . ,  r„)T  and  a  “column  weight”  vector  s  =  (s i, . . .  ,^)T,  we  may 
represent  each  row  category  by  r? ,  i  —  1 , ,n  and  each  column  category  by 
Sj ,  j  —  1 in  a  one-dimensional  graph.  If  in  this  graph  r7  and  s7  are  in 
close  proximity  (far  from  the  origin),  this  would  indicate  that  the  i th  row  category 
has  an  important  conditional  frequency  Xtj/x.j  in  (15.1)  and  that  the  j  th  column 
category  has  an  important  conditional  frequency  Xy/x7.  in  (15.2).  This  would 
indicate  a  positive  association  between  the  i  th  row  and  the  j  th  column.  A  similar 
line  of  argument  could  be  used  if  r7  was  very  far  away  from  sj  (and  far  from 
the  origin).  This  would  indicate  a  small  conditional  frequency  contribution,  or  a 
negative  association  between  the  i  th  row  and  the  j  th  column. 


Summary 

/  A 

^  The  aim  of  correspondence  analysis  is  to  develop  simple  indices 
that  show  relations  among  qualitative  variables  in  a  contingency 
table. 

^  The  joint  representation  of  the  indices  reveals  relations  among  the 
variables. 

15.2  Chi-Square  Decomposition 

An  alternative  way  of  measuring  the  association  between  the  row  and  column 
categories  is  a  decomposition  of  the  value  of  the  /2-test  statistic.  The  well- 
known  /2-test  for  independence  in  a  two-dimensional  contingency  table  consists 
of  two  steps.  First  the  expected  value  of  each  cell  of  the  table  is  estimated  under 
the  hypothesis  of  independence.  Second,  the  corresponding  observed  values  are 
compared  to  the  expected  values  using  the  statistic 

n  p 

t  =  J2  ZAi?  -  Eq)2/Eq,  (15.3) 

i  =  1 j  = 1 

where  x77  is  the  observed  frequency  in  cell  (i ,  j )  and  Ejj  is  the  corresponding 
estimated  expected  value  under  the  assumption  of  independence,  i.e. 

Xf  •  Xmj 


(15.4) 
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Here  =  YH=i  Under  the  hypothesis  of  independence,  t  has  a  x\n-\)(p-\) 
distribution.  In  the  industrial  location  example  introduced  above  the  value  of  t  — 
6.26  is  almost  significant  at  the  5  %  level.  It  is  therefore  worth  investigating  the 
special  reasons  for  departure  from  independence. 

The  method  of  / 2  decomposition  consists  of  finding  the  S  VD  of  the  matrix  C  (nx 
p )  with  elements 


Cy  =  (Xy  -  Ey)/E f2.  (15.5) 

The  elements  Cy  may  be  viewed  as  measuring  the  (weighted)  departure  between  the 
observed  Xy  and  the  theoretical  values  Ey  under  independence.  This  leads  to  the 
factorial  tools  of  Chap.  10  which  describe  the  rows  and  the  columns  of  C. 

For  simplification  define  the  matrices  A  (n  x  n)  and  B  (p  x  p)  as 

A  —  diag(x;.)  and  B  =  dia g(x#7).  (15.6) 

These  matrices  provide  the  marginal  row  frequencies  a(n  x  1)  and  the  marginal 
column  frequencies  b  (p  x  1): 

a  —  Aln  and  b  —  B\p.  (15.7) 


It  is  easy  to  verify  that 


C  Vb  —  0  and  CT \fa  —  0, 


(15.8) 


where  the  square  root  of  the  vector  is  taken  element  by  element  and  R  —  rank(C)  < 
min {{n  —  1),  ( p  —  1)}.  From  (10.14)  of  Chap.  10,  the  SVD  of  C  yields 

C  =  TAAt,  (15.9) 

where  T  contains  the  eigenvectors  of  CCT ,  A  the  eigenvectors  of  CTC  and  A  = 

diag(A}^2, . . . ,  A)/2)  with  Ai  >  A2  >  •••  >  A r  (the  eigenvalues  of  CCT). 
Equation  (15.9)  implies  that 


R 

Cij  ~  Yikfyk- 

k  =  1 


(15.10) 


Note  that  (15.3)  can  be  rewritten  as 


R  n  p 

tr  (CCT)  =  J2Xk  = 

k  =  1  i  =  1 j = 1 


(15.11) 
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This  relation  shows  that  the  S  VD  of  C  decomposes  the  total  y2  value  rather  than, 
as  in  Chap.  10,  the  total  variance. 

The  duality  relations  between  the  row  and  the  column  space  (10.1 1)  are  now  for 
k  —  1 , . . . ,  R  given  by 


Sk  ~  JhcTn’ 

n  =  7^CSk- 

The  projections  of  the  rows  and  the  columns  of  C  are  given  by 


(15.12) 


—  a/  kk  Yk  i 

CTyk  —  \fkkh- 

Note  that  the  eigenvectors  satisfy 

8j  Vb  —  0,  yj  +J~a  —  0. 

From  (15.10)  we  see  that  the  eigenvectors  8k  and  yk  are  the  objects  of  interest  when 
analysing  the  correspondence  between  the  rows  and  the  columns.  Suppose  that  the 
first  eigenvalue  in  (15.10)  is  dominant  so  that 

Cij  as  X\/2YnSji.  (15.15) 

In  this  case  when  the  coordinates  yn  and  8j\  are  both  large  (with  the  same  sign) 
relative  to  the  other  coordinates,  then  Cy  will  be  large  as  well,  indicating  a  positive 
association  between  the  i  th  row  and  the  j  th  column  category  of  the  contingency 
table.  If  yn  and  8j\  were  both  large  with  opposite  signs,  then  there  would  be  a 
negative  association  between  the  i  th  row  and  j  th  column. 

In  many  applications,  the  first  two  eigenvalues,  X\  and  X2,  dominate  and  the 
percentage  of  the  total  y2  explained  by  the  eigenvectors  y\  and  72  and  <5i  and  82  is 
large.  In  this  case  (15.13)  and  (yi ,  y?)  can  be  used  to  obtain  a  graphical  display  of  the 
n  rows  of  the  table  ((<5i,  £2)  play  a  similar  role  for  the  p  columns  of  the  table).  The 
interpretation  of  the  proximity  between  row  and  column  points  will  be  interpreted 
as  above  with  respect  to  (15.10). 

In  correspondence  analysis,  we  use  the  projections  of  weighted  rows  of  C  and 
the  projections  of  weighted  columns  of  C  for  graphical  displays.  Let  Vkin  x  1)  be 
the  projections  of  A~l^2C  on  8k  and  Skip  x  1)  be  the  projections  of  B~l^2CT  on  yk 
(/:  =  !,...,  R): 


(15.13) 


(15.14) 


rk  =  A-^2C8k  =  VhA-l'2yk, 
sk  =  B~l/2CTyk  =  VhB^2Sk. 


(15.16) 
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These  vectors  have  the  property  that 


rJa  =  0, 
sjb  =  0. 


(15.17) 


The  obtained  projections  on  each  axis  k  —  1, . . . ,  R  are  centred  at  zero  with  the 
natural  weights  given  by  a  (the  marginal  frequencies  of  the  rows  of  X)  for  the  row 
coordinates  r \  and  by  b  (the  marginal  frequencies  of  the  columns  of  X)  for  the 
column  coordinates  Sk  (compare  this  to  expression  (15.14)).  As  a  result,  the  origin 
is  the  centre  of  gravity  for  all  of  the  representations.  We  also  know  from  (15.16)  and 
the  SVD  of  C  that 


rjArk  =  Xk, 
sjBsk  =  Xk. 


(15.18) 


From  the  duality  relation  between  Sk  and  yk  (see  (15.12))  we  obtain 


rk  =  -jtA~l'2CBl'2sk, 
sk  =  -k=B-1/2CJAl/2rk, 


which  can  be  simplified  to 


n  =  /f-4  1Xsk’ 

Sk  =  f&rWn. 


(15.19) 


(15.20) 


These  vectors  satisfy  the  relations  (15.1)  and  (15.2)  for  each  k  —  1 
simultaneously. 

As  in  Chap.  10,  the  vectors  and  Sk  are  referred  to  as  factors  (row  factor  and 
column  factor  respectively).  They  have  the  following  means  and  variances: 


rk 

Sk 


l 


x## 

1 


x## 


rJa  =  0, 

sjb  =  0, 


Var  (rk)  = 
Var  (sk)  = 


1  V"  V.  r2  —  r^Al'k  —  At. 
x##  2-^i  =  \  Sf'ki  x#*  x#*’ 

1  X^P  r  .  c2  _  SJ  Bsk  _  Xk 

x##  i  *y  kj  x#*  x## 


(15.21) 


(15.22) 


Hence,  X k  1^721=1  ^  J »  which  is  the  part  of  the  kth  factor  in  the  decomposition  of  the 
X2  statistic  t,  may  also  be  interpreted  as  the  proportion  of  the  variance  explained  by 
the  factor  k.  The  proportions 

Xi*rki  r  •  i  7i  D 

— — - ,  lor  i  —  1 , . . . ,  n ,  k  —  1 , . . . ,  R 

Xk 


Ca(i,rk)  = 


(15.23) 
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are  called  the  absolute  contributions  of  row  i  to  the  variance  of  the  factor  rk .  They 
show  which  row  categories  are  most  important  in  the  dispersion  of  the  kth  row 
factors.  Similarly,  the  proportions 


Ca(j,Sk )  =  — : - ,  for  j  =  1 k  = 


(15.24) 


are  called  the  absolute  contributions  of  column  j  to  the  variance  of  the  column 
factor  sk .  These  absolute  contributions  may  help  to  interpret  the  graph  obtained  by 
correspondence  analysis. 


15.3  Correspondence  Analysis  in  Practice 

The  graphical  representations  on  the  axes  k  =  1 , 2, . . . ,  R  of  the  n  rows  and  of  the  p 
columns  of  X  are  provided  by  the  elements  of  rk  and  sk .  Typically,  two-dimensional 
displays  are  often  satisfactory  if  the  cumulated  percentage  of  variance  explained  by 
the  first  two  factors,  4^  =  jtlR+X2  ,  is  sufficiently  large. 

2^k= 1 

The  interpretation  of  the  graphs  may  be  summarised  as  follows: 

-  The  proximity  of  two  rows  (two  columns)  indicates  a  similar  profile  in  these 
two  rows  (two  columns),  where  “profile”  refers  to  the  conditional  frequency 
distribution  of  a  row  (column);  those  two  rows  (columns)  are  almost  proportional. 
The  opposite  interpretation  applies  when  the  two  rows  (two  columns)  are  far 
apart. 

-  The  proximity  of  a  particular  row  to  a  particular  column  indicates  that  this  row 
(column)  has  a  particularly  important  weight  in  this  column  (row).  In  contrast  to 
this,  a  row  that  is  quite  distant  from  a  particular  column  indicates  that  there  are 
almost  no  observations  in  this  column  for  this  row  (and  vice  versa).  Of  course,  as 
mentioned  above,  these  conclusions  are  particularly  true  when  the  points  are  far 
away  from  0. 

-  The  origin  is  the  average  of  the  factors  rk  and  sk .  Hence,  a  particular  point  (row 
or  column)  projected  close  to  the  origin  indicates  an  average  profile. 

-  The  absolute  contributions  are  used  to  evaluate  the  weight  of  each  row  (column) 
in  the  variances  of  the  factors. 

-  All  the  interpretations  outlined  above  must  be  carried  out  in  view  of  the  quality  of 
the  graphical  representation  which  is  evaluated,  as  in  PCA,  using  the  cumulated 
percentage  of  variance. 

Remark  15.1  Note  that  correspondence  analysis  can  also  be  applied  to  more  general 
(n  x  p)  tables  X  which  in  a  “strict  sense”  are  not  contingency  tables. 

As  long  as  statistical  (or  natural)  meaning  can  be  given  to  sums  over  rows  and 
columns,  Remark  15.1  holds.  This  implies,  in  particular,  that  all  of  the  variables 
are  measured  in  the  same  units.  In  that  case,  constitutes  the  total  frequency 
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of  the  observed  phenomenon,  and  is  shared  between  individuals  ( n  rows)  and 
between  variables  ( p  columns).  Representations  of  the  rows  and  columns  of  A', 
and  Sk,  have  the  basic  property  (15.19)  and  show  which  variables  have  important 
weights  for  each  individual  and  vice  versa.  This  type  of  analysis  is  used  as  an 
alternative  to  PC  A.  PC  A  is  mainly  concerned  with  covariances  and  correlations, 
whereas  correspondence  analysis  analyses  a  more  general  kind  of  association.  (See 
Exercises  15.3  and  15.11.) 

Example  15.3  A  survey  of  Belgium  citizens  who  regularly  read  a  newspaper  was 
conducted  in  the  1980s.  They  were  asked  where  they  lived.  The  possible  answers 
were  ten  regions:  seven  provinces  (Antwerp,  Western  Flanders,  Eastern  Flan¬ 
ders,  Hainant,  Liege,  Limbourg,  Luxembourg)  and  three  regions  around  Brussels 
(Flemish-Brabant,  Wallon-Brabant  and  the  city  of  Brussels).  They  were  also  asked 
what  kind  of  newspapers  they  read  on  a  regular  basis.  There  were  15  possible 
answers  split  up  into  three  classes:  Flemish  newspapers  (label  begins  with  the  letter 
u),  French  newspapers  (label  begins  with  /)  and  both  languages  together  (label 
begins  with  b).  The  data  set  is  given  in  Table  22.9.  The  eigenvalues  of  the  factorial 
correspondence  analysis  are  given  in  Table  15.1. 

Two-dimensional  representations  will  be  quite  satisfactory  since  the  first  two 
eigenvalues  account  for  81  %  of  the  variance.  Figure  15.1  shows  the  projections 
of  the  rows  (the  15  newspapers)  and  of  the  columns  (the  ten  regions). 

As  expected,  there  is  a  high  association  between  the  regions  and  the  type 
of  newspapers  which  is  read.  In  particular,  Vb  (Gazet  van  Antwerp)  is  almost 
exclusively  read  in  the  province  of  Antwerp  (this  is  an  extreme  point  in  the  graph). 
The  points  on  the  left  all  belong  to  Flanders,  whereas  those  on  the  right  all  belong  to 
Wallonia.  Notice  that  the  Wallon-Brabant  and  the  Flemish-Brabant  are  not  far  from 
Brussels.  Brussels  is  close  to  the  centre  (average)  and  also  close  to  the  bilingual 
newspapers.  It  is  shifted  a  little  to  the  right  of  the  origin  due  to  the  majority  of 
French  speaking  people  in  the  area. 

The  absolute  contributions  of  the  first  three  factors  are  fisted  in  Tables  15.2 
and  15.3.  The  row  factors  r \  are  in  Table  15.2  and  the  column  factors  Sk  are  in 
Table  15.3. 


Table  15.1  Eigenvalues  and 
percentages  of  the  variance 
(Example  15.3) 


Percentage  of  variance 

Cumulated  percentage 

183.40 

0.653 

0.653 

43.75 

0.156 

0.809 

25.21 

0.090 

0.898 

11.74 

0.042 

0.940 

8.04 

0.029 

0.969 

4.68 

0.017 

0.985 

2.13 

0.008 

0.993 

1.20 

0.004 

0.997 

0.82 

0.003 

1.000 

0.00 

0.000 

1.000 
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Fig.  15.1  Projection  of  rows  (the  15  newspapers)  and  columns  (the  ten  regions)  Q 
MVAcorr j ourn 


Table  15.2  Absolute 
contributions  of  row  factors 

n 


Ca(i,ri) 

Ca(i,r2 ) 

Ca(i,r3) 

Va 

0.0563 

0.0008 

0.0036 

Vb 

0.1555 

0.5567 

0.0067 

Vc 

0.0244 

0.1179 

0.0266 

Vd 

0.1352 

0.0952 

0.0164 

Ve 

0.0253 

0.1193 

0.0013 

f/ 

0.0314 

0.0183 

0.0597 

fg 

0.0585 

0.0162 

0.0122 

A 

0.1086 

0.0024 

0.0656 

f 

0.1001 

0.0024 

0.6376 

bJ 

0.0029 

0.0055 

0.0187 

h 

0.0236 

0.0278 

0.0237 

bi 

0.0006 

0.0090 

0.0064 

vm 

0.1000 

0.0038 

0.0047 

fn 

0.0966 

0.0059 

0.0269 

fo 

0.0810 

0.0188 

0.0899 

Total 

1.0000 

1 .0000 

1 .0000 

They  show,  for  instance,  the  important  role  of  Antwerp  and  the  newspaper 
Vb  in  determining  the  variance  of  both  factors.  Clearly,  the  first  axis  expresses 
linguistic  differences  between  the  three  parts  of  Belgium.  The  second  axis  shows 
a  larger  dispersion  between  the  Flemish  region  than  the  French  speaking  regions. 


15.3  Correspondence  Analysis  in  Practice 


435 


Table  15.3  Absolute 
contributions  of  column 
factors  Sk 


c„Cmi) 

Ca(j,s2 ) 

Ca  O',  *3) 

brw 

0.0887 

0.0210 

0.2860 

bxl 

0.1259 

0.0010 

0.0960 

anv 

0.2999 

0.4349 

0.0029 

brf 

0.0064 

0.2370 

0.0090 

foe 

0.0729 

0.1409 

0.0033 

for 

0.0998 

0.0023 

0.0079 

hai 

0.1046 

0.0012 

0.3141 

hg 

0.1168 

0.0355 

0.1025 

lim 

0.0562 

0.1162 

0.0027 

lux 

0.0288 

0.0101 

0.1761 

Total 

1 .0000 

1 .0000 

1 .0000 
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Fig.  15.2  Correspondence  analysis  including  Corsica  Q  MVAcorrbac 


Note  also  that  the  third  axis  shows  an  important  role  of  the  category  “/•”  (other 
French  newspapers)  with  the  Wallon-Brabant  “brw”  and  the  Hainant  “hai”  showing 
the  most  important  contributions.  The  coordinate  of  “/•”  on  this  axis  is  negative 
(not  shown  here)  so  are  the  coordinates  of  “brw”  and  “hai”.  Apparently,  these 
two  regions  also  seem  to  feature  a  greater  proportion  of  readers  of  more  local 
newspapers. 

Example  15.4  Applying  correspondence  analysis  to  the  French  baccalaureat  data 
(Table  22.8)  leads  to  Fig.  15.2.  Excluding  Corsica  we  obtain  Fig.  15.3.  The  different 
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Fig.  15.3  Correspondence  analysis  excluding  Corsica  Q  MVAcorrbac 


Table  15.4  Eigenvalues  and  percentages  of  explained  variance  (including  Corsica) 


Eigenvalues  A 

Percentage  of  variances 

Cumulated  percentage 

2,436.2 

0.5605 

0.561 

1,052.4 

0.2421 

0.803 

341.8 

0.0786 

0.881 

229.5 

0.0528 

0.934 

152.2 

0.0350 

0.969 

109.1 

0.0251 

0.994 

25.0 

0.0058 

1.000 

0.0 

0.0000 

1.000 
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modalities  are  labeled  A,  . . . ,  H  and  the  regions  are  labeled  ILDF,  . . . ,  CORS. 
The  results  of  the  correspondence  analysis  are  given  in  Table  15.4  and  Fig.  15.2. 

The  first  two  factors  explain  80  %  of  the  total  variance.  It  is  clear  from  Fig.  15.2 
that  Corsica  (in  the  upper  left)  is  an  outlier.  The  analysis  is  therefore  redone  without 
Corsica  and  the  results  are  given  in  Table  15.5  and  Fig.  15.3.  Since  Corsica  has  such 
a  small  weight  in  the  analysis,  the  results  have  not  changed  much. 

The  projections  on  the  first  three  axes,  along  with  their  absolute  contribution 
to  the  variance  of  the  axis,  are  summarised  in  Table  15.6  for  the  regions  and  in 
Table  15.7  for  baccalaureats. 

The  interpretation  of  the  results  may  be  summarised  as  follows.  Table  15.7  shows 
that  the  baccalaureats  B  on  one  side  and  F  on  the  other  side  are  most  strongly 
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Table  15.5  Eigenvalues  and  percentages  of  explained  variance  (excluding  Corsica) 


Eigenvalues  A 

Percentage  of  variances 

Cumulated  percentage 

2,408.6 

0.5874 

0.587 

909.5 

0.2218 

0.809 

318.5 

0.0766 

0.887 

195.9 

0.0478 

0.935 

149.3 

0.0304 

0.971 

96.1 

0.0234 

0.994 

22.8 

0.0056 

1.000 

0.0 

0.0000 

1.000 

Table  15.6  Coefficients  and  absolute  contributions  for  regions,  Example  15.4 


Region 

r  l 

ri 

r?> 

Ca(i,r  l) 

Ca(i,r2 ) 

C(/,r3) 

ILDF 

0.1464 

0.0677 

0.0157 

0.3839 

0.2175 

0.0333 

CHAM 

-0.0603 

-0.0410 

-0.0187 

0.0064 

0.0078 

0.0047 

PICA 

0.0323 

-0.0258 

-0.0318 

0.0021 

0.0036 

0.0155 

HNOR 

-0.0692 

0.0287 

0.1156 

0.0096 

0.0044 

0.2035 

CENT 

-0.0068 

-0.0205 

-0.0145 

0.0001 

0.0030 

0.0043 

BNOR 

-0.0271 

-0.0762 

0.0061 

0.0014 

0.0284 

0.0005 

BOUR 

-0.1921 

0.0188 

0.0578 

0.0920 

0.0023 

0.0630 

NOPC 

-0.1278 

0.0863 

-0.0570 

0.0871 

0.1052 

0.1311 

LORR 

-0.2084 

0.0511 

0.0467 

0.1606 

0.0256 

0.0608 

ALSA 

-0.2331 

0.0838 

0.0655 

0.1283 

0.0439 

0.0767 

FRAC 

-0.1304 

-0.0368 

-0.0444 

0.0265 

0.0056 

0.0232 

PAYL 

-0.0743 

-0.0816 

-0.0341 

0.0232 

0.0743 

0.0370 

BRET 

0.0158 

0.0249 

-0.0469 

0.0011 

0.0070 

0.0708 

PCHA 

-0.0610 

-0.1391 

-0.0178 

0.0085 

0.1171 

0.0054 

AQUI 

0.0368 

-0.1183 

0.0455 

0.0055 

0.1519 

0.0643 

MIDI 

0.0208 

-0.0567 

0.0138 

0.0018 

0.0359 

0.0061 

LIMO 

-0.0540 

0.0221 

-0.0427 

0.0033 

0.0014 

0.0154 

RHOA 

-0.0225 

0.0273 

-0.0385 

0.0042 

0.0161 

0.0918 

AUVE 

0.0290 

-0.0139 

-0.0554 

0.0017 

0.0010 

0.0469 

LARO 

0.0290 

-0.0862 

-0.0177 

0.0383 

0.0595 

0.0072 

PROV 

0.0469 

-0.0717 

0.0279 

0.0142 

0.0884 

0.0383 

responsible  for  the  variation  on  the  first  axis.  The  second  axis  mostly  characterises 
an  opposition  between  baccalaureats  A  and  C.  Regarding  the  regions,  lie  de  France 
plays  an  important  role  on  each  axis.  On  the  first  axis,  it  is  opposed  to  Lorraine 
and  Alsace,  whereas  on  the  second  axis,  it  is  opposed  to  Poitou-Charentes  and 
Aquitaine.  All  of  this  is  confirmed  in  Fig.  15.3. 

On  the  right  side  are  the  more  classical  baccalaureats  and  on  the  left,  more 
technical  ones.  The  regions  on  the  left  side  have  thus  larger  weights  in  the  technical 
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Table  15.7  Coefficients  and  absolute  contributions  for  baccalaureats,  Example  15.4 


Baccal 

Sl 

Sl 

S3 

Ca(J,Si) 

Ca(j,s2 ) 

Ca(j,S3 ) 

A 

0.0447 

-0.0679 

0.0367 

0.0376 

0.2292 

0.1916 

B 

0.1389 

0.0557 

0.0011 

0.1724 

0.0735 

0.0001 

C 

0.0940 

0.0995 

0.0079 

0.1198 

0.3556 

0.0064 

D 

0.0227 

-0.0495 

-0.0530 

0.0098 

0.1237 

0.4040 

E 

-0.1932 

0.0492 

-0.1317 

0.0825 

0.0141 

0.2900 

F 

-0.2156 

0.0862 

0.0188 

0.3793 

0.1608 

0.0219 

G 

-0.1244 

-0.0353 

0.0279 

0.1969 

0.0421 

0.0749 

H 

-0.0945 

0.0438 

-0.0888 

0.0017 

0.0010 

0.0112 

Table  15.8  Eigenvalues  and 
explained  proportion  of 
variance,  Example  15.5 


Percentage  of  variance 

Cumulated  percentage 

4,399.0 

0.4914 

0.4914 

2,213.6 

0.2473 

0.7387 

1,382.4 

0.1544 

0.8932 

870.7 

0.0973 

0.9904 

51.0 

0.0057 

0.9961 

34.8 

0.0039 

1 .0000 

0.0 

0.0000 

0.0000 

baccalaureats.  Note  also  that  most  of  the  southern  regions  of  France  are  concentrated 
in  the  lower  part  of  the  graph  near  the  baccalaureat  A. 

Finally,  looking  at  the  third  axis,  we  see  that  it  is  dominated  by  the  baccalaureat 
E  (negative  sign)  and  to  a  lesser  degree  by  H  (negative)  (as  opposed  to  A  (positive 
sign)).  The  dominating  regions  are  HNOR  (positive  sign),  opposed  to  NOPC  and 
AUVE  (negative  sign).  For  instance,  HNOR  is  particularly  poor  in  baccalaureat  D. 

Example  15.5  The  US  crime  data  set  (Table  22.10)  gives  the  number  of  crimes  in 
the  50  states  of  the  US  classified  in  1985  for  each  of  the  following  seven  categories: 
murder,  rape,  robbery,  assault,  burglary,  larceny  and  auto-theft.  The  analysis  of  the 
contingency  table,  limited  to  the  first  two  factors,  provides  the  following  results  (see 
Table  15.8). 

Looking  at  the  absolute  contributions  (not  reproduced  here,  see  Exercise  15.6),  it 
appears  that  the  first  axis  is  robbery  (+)  versus  larceny  (— )  and  auto- theft  (— )  axis 
and  that  the  second  factor  contrasts  assault  (— )  to  auto-theft  (+).  The  dominating 
states  for  the  first  axis  are  the  North-Eastern  States  MA  (+)  and  NY  (+)  contrasting 
the  Western  States  WY  (— )  and  ID  (— ).  For  the  second  axis,  the  differences  are 
seen  between  the  Northern  States  (MA  (+)  and  RI  (+))  and  the  Southern  States 
AL  (— ),  MS  (— )  and  AR  (— ).  These  results  can  be  clearly  seen  in  Fig.  15.4  where 
all  the  states  and  crimes  are  reported.  The  figure  also  shows  in  which  states  the 
proportion  of  a  particular  crime  category  is  higher  or  lower  than  the  national  average 
(the  origin). 
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Fig.  15.4  Projection  of  rows  (the  50  states)  and  columns  (the  seven  crime  categories)  Q 
MVAcorr crime 


Biplots 


The  biplot  is  a  low-dimensional  display  of  a  data  matrix  X  where  the  rows  and 
columns  are  represented  by  points.  The  interpretation  of  a  biplot  is  specifically 
directed  towards  the  scalar  products  of  lower  dimensional  factorial  variables  and 
is  designed  to  approximately  recover  the  individual  elements  of  the  data  matrix  in 
these  scalar  products.  Suppose  that  we  have  a  (10  x  5)  data  matrix  with  elements  Xy. 
The  idea  of  the  biplot  is  to  find  10  row  points  e  Rk  (k  <  p,  i  =  1, . . . ,  10)  and 
5  column  points  tj  G  Rk  (j  =  1, . . . ,  5)  such  that  the  50  scalar  products  between 
the  row  and  the  column  vectors  closely  approximate  the  50  corresponding  elements 
of  the  data  matrix  X.  Usually  we  choose  k  —  2.  For  example,  the  scalar  product 
between  q 7  and  U  should  approximate  the  data  value  X74  in  the  seventh  row  and 
the  fourth  column.  In  general,  the  biplot  models  the  data  Xy  as  the  sum  of  a  scalar 
product  in  some  low-dimensional  subspace  and  a  residual  “error”  term: 


,T 


Xij  ~  Cli  0  +  eij 
—  ^  ^  Qiktjk  T 


(15.25) 
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To  understand  the  link  between  correspondence  analysis  and  the  biplot,  we  need  to 
introduce  a  formula  which  expresses  Xy  from  the  original  data  matrix  (see  (15.3))  in 
terms  of  row  and  column  frequencies.  One  such  formula,  known  as  the  “reconstitu¬ 
tion  formula”,  is  (15.10): 


^2k  =  l  ^ k  Yik&jk 


1  + 


\ 


XfXm 


J 


X*m 


J 


(15.26) 


Consider  now  the  row  profiles  Xy  / Xi  •  (the  conditional  frequencies)  and  the  average 
row  profile  Xj./x...  From  (15.26)  we  obtain  the  difference  between  each  row  profile 
and  this  average: 


Xi. 


(15.27) 


By  the  same  argument  we  can  also  obtain  the  difference  between  each  column 
profile  and  the  average  column  profile: 


Xjj 

X.j 


(15.28) 


Now,  if  Ai  A  2  A3 . . .,  we  can  approximate  these  sums  by  a  finite  number  of 

K  terms  (usually  K  —  2)  using  (15.16)  to  obtain 


X.j 


X 


/• 


skj  +  eij, 


X • 


J 


x 


■■Skj  )  rki  +  e'j, 


•• 


(15.29) 

(15.30) 


where  eij  and  e[j  are  error  terms.  Equation  (15.30)  shows  that  if  we  consider 
displaying  the  differences  between  the  row  profiles  and  the  average  profile,  then 
the  projection  of  the  row  profile  rk  and  a  rescaled  version  of  the  projections  of  the 
column  profile  sk  constitute  a  biplot  of  these  differences.  Equation  (15.29)  implies 
the  same  for  the  differences  between  the  column  profiles  and  this  average. 
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i*w  # 

Summary 

/  A 

^  Correspondence  analysis  is  a  factorial  decomposition  of  contin¬ 
gency  tables.  The  p -dimensional  individuals  and  the  n -dimensional 
variables  can  be  graphically  represented  by  projecting  onto  spaces 
of  smaller  dimension. 

^  The  practical  computation  consists  of  first  computing  a  spectral 
decomposition  of  A~1  XB~l  XT  and  B~l  XTA~1  X  which  have  the 
same  first  p  eigenvalues.  The  graphical  representation  is  obtained 
by  plotting  +JX\r\  vs.  \fXfr2  and  y/X[s\  vs.  \pA2S2.  Both  plots 
maybe  displayed  in  the  same  graph  taking  into  account  the  appro¬ 
priate  orientation  of  the  eigenvectors  r? ,  sj . 

^  Correspondence  analysis  provides  a  graphical  display  of  the  asso¬ 
ciation  measure  Cy  —  (xy  —  E[f)2  / Ey. 

Biplot  is  a  low-dimensional  display  of  a  data  matrix  where  the  rows 
and  columns  are  represented  by  points 
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Exercise  15.1  Show  that  the  matrices  A  1 XB  1XT  and  B  lXTA  lX  have  an 
eigenvalue  equal  to  1  and  that  the  corresponding  eigenvectors  are  proportional  to 

Exercise  15.2  Verify  the  relations  in  (15.8),  (15.14)  and  (15.17). 


Exercise  15.3  Do  a  correspondence  analysis  for  the  car  marks  data  (Table  22.7)1 
Explain  how  this  table  can  be  considered  as  a  contingency  table. 

Exercise  15.4  Compute  the  /2 -statistic  of  independence  for  the  French  baccalau- 
reat  data. 

Exercise  15.5  Prove  that  C  —  A~l^2(X  —  E)B~1^2  ^/xmm  and  E  —  and 
verify  (15.20). 


Exercise  15.6  Do  the  full  correspondence  analysis  of  the  US  crime  data 
(Table  22.10),  and  determine  the  absolute  contributions  for  the  first  three  axes. 
How  can  you  interpret  the  third  axis?  Try  to  identify  the  states  with  one  of  the  four 
regions  to  which  it  belongs.  Do  you  think  the  four  regions  have  a  different  behaviour 
with  respect  to  crime  ? 

Exercise  15.7  Repeat  Exercise  15.6  with  the  US  health  data  (Table  22.16).  Only 
analyse  the  columns  indicating  the  number  of  deaths  per  state. 
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Exercise  15.8  Consider  a  (n  x  n)  contingency  table  being  a  diagonal  matrix  X. 
What  do  you  expect  the  factors  r^,Sk  to  be  like? 

Exercise  15.9  Assume  that  after  some  reordering  of  the  rows  and  the  columns,  the 
contingency  table  has  the  following  structure: 


X  = 


Jx 

Jl 

h 

* 

0 

h 

0 

* 

That  is,  the  rows  f  only  have  weights  in  the  columns  J\,  for  i  =  1,2.  What  do  you 
expect  the  graph  of  the  first  two  factors  to  look  like? 

Exercise  15.10  Redo  Exercise  15.9  using  the  following  contingency  table: 


X  = 


Jx 

Jl 

Jl 

h 

* 

0 

0 

h 

0 

* 

0 

h 

0 

0 

* 

Exercise  15.11  Consider  the  French  food  data  ( Table  22.6).  Given  that  all  of  the 
variables  are  measured  in  the  same  units  (Francs),  explain  how  this  table  can  be 
considered  as  a  contingency  table.  Perform  a  correspondence  analysis  and  compare 
the  results  to  those  obtained  in  the  NPCA  analysis  in  Chap.  11. 


Chapter  16 

Canonical  Correlation  Analysis 


Complex  multivariate  data  structures  are  better  understood  by  studying  low¬ 
dimensional  projections.  For  a  joint  study  of  two  data  sets,  we  may  ask  what  type 
of  low-dimensional  projection  helps  in  finding  possible  joint  structures  for  the  two 
samples.  The  canonical  correlation  analysis  (CCA)  is  a  standard  tool  of  multivariate 
statistical  analysis  for  discovery  and  quantification  of  associations  between  two  sets 
of  variables. 

The  basic  technique  is  based  on  projections.  One  defines  an  index  (projected 
multivariate  variable)  that  maximally  correlates  with  the  index  of  the  other  variable 
for  each  sample  separately.  The  aim  of  CCA  is  to  maximise  the  association 
(measured  by  correlation)  between  the  low-dimensional  projections  of  the  two  data 
sets.  The  canonical  correlation  vectors  are  found  by  a  joint  covariance  analysis 
of  the  two  variables.  The  technique  is  applied  to  a  marketing  example  where  the 
association  of  a  price  factor  and  other  variables  (like  design,  sportiness  etc.)  is 
analysed.  Tests  are  given  on  how  to  evaluate  the  significance  of  the  discovered 
association. 


16.1  Most  Interesting  Linear  Combination 

The  associations  between  two  sets  of  variables  may  be  identified  and  quantified  by 
CCA.  The  technique  was  originally  developed  by  Hotelling  (1935)  who  analysed 
how  arithmetic  speed  and  arithmetic  power  are  related  to  reading  speed  and  reading 
power.  Other  examples  are  the  relation  between  governmental  policy  variables 
and  economic  performance  variables  and  the  relation  between  job  and  company 
characteristics. 
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Suppose  we  are  given  two  random  variables  and  Y  eRp .  The  idea  is  to 

find  an  index  describing  a  (possible)  link  between  X  and  Y .  CCA  is  based  on  linear 
indices,  i.e.  linear  combinations 


aTX  and  bTY 

of  the  random  variables.  CCA  searches  for  vectors  a  and  b  such  that  the  relation 
of  the  two  indices  aT x  and  bT y  is  quantified  in  some  interpretable  way.  More 
precisely,  one  is  looking  for  the  “most  interesting”  projections  a  and  b  in  the  sense 
that  they  maximise  the  correlation 

p(a,b)  =  pa  TXbTY  (16.1) 


between  the  two  indices. 

Let  us  consider  the  correlation  p(a,b)  between  the  two  projections  in  more  detail. 
Suppose  that 


(x\  (  (lA  {XxxXxy\\ 

\rj  ~  \  (vj  ’  Ul*  SyyJ  J 

where  the  sub-matrices  of  this  covariance  structure  are  given  by 

Var(X)  =  Yxx  (q  x  q ) 

Var(T)  =  (p  x  p) 

Cov(X,  Y )  =  E(X  -  n)(Y  -  v)T  =  SXF  =  (q  x  p). 
Using  (3.7)  and  (4.26), 


p(a,b) 


a Yxyb 

(aTYXxa)l/2  (b^Yryb)112 


(16.2) 


Therefore,  p(ca,b)  =  p(a,b)  for  any  c  e  M+.  Given  the  invariance  of  scale  we 
may  rescale  projections  a  and  b  and  thus  we  can  equally  solve 

max  aTYxyb 

a,b 


Yxx®  —  1 
b~^  Yyyb  —  1. 


under  the  constraints 


16.1  Most  Interesting  Linear  Combination 
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For  this  problem,  define 


AC  =  (16.3) 

Recall  the  singular  value  decomposition  of  JC{q  x  p )  from  Theorem  2.2.  The  matrix 
1C  may  be  decomposed  as 


JC  =  TAAt 


with 

r  =  (yu  ■  ■  ■  ,n) 

A  =  (8u...,8k)  (16.4) 

A  =  diag(A[/2, . . . ,  X1/2) 

where  by  (16.3)  and  (2.15), 

k  —  rank(/C)  =  rank(Exr)  =  rank(Eyx)  , 

and  X\  >  A  2  >  •  •  •  Xk  are  the  nonzero  eigenvalues  of  N\  —  /C/CT  and  A/2  =  KJ  1C 
and  Yi  and  8j  are  the  standardised  eigenvectors  of  A/)  and  A/2  respectively. 

Define  now  for  i  =  1 , . . . ,  k  the  vectors 

at  =  S^/2y,-,  (16.5) 

bi  =  X„/28i,  (16.6) 

which  are  called  the  canonical  correlation  vectors.  Using  these  canonical  correla¬ 
tion  vectors  we  define  the  canonical  correlation  variables 

rji  —  aj  X  (16.7) 

(pi=bjY.  (16.8) 

1  /2 

The  quantities  pi  —  Az-  for  i  —  1, . . . ,  k  are  called  the  canonical  correlation 
coefficients. 

From  the  properties  of  the  singular  value  decomposition  given  in  (16.4)  we  have 
Cov(»7  j,t]j)  =  aj  Zxxaj  =  y7yj  =  j  ^  (16.9) 

The  same  is  true  for  Com  {ip t  ,<pj ).  The  following  theorem  tells  us  that  the  canonical 
correlation  vectors  are  the  solution  to  the  maximisation  problem  of  (16.1). 
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Theorem  16.1  For  any  given  r,\<r<k,  the  maximum 

C(r)  =  ma  xaTTiXyb  (16.10) 

a,b 


subject  to 


Yjxx®  —  1,  b~^  ^uyyb  —  1 


and 


aj  'Exx®  —  0 for  i  =  1 , . . . ,  r  —  1 


is  given  by 


C(r)  =  pr  =  Ai/2 


and  is  attained  when  a  —  ar  and  b  —  br. 
Proof  The  proof  is  given  in  three  steps. 

(i)  Fix  a  and  maximise  over  b ,  i.e.  solve: 


max  ( a'^xyby  =  max  (b'^yx^)  (<zTXxyft) 

b  b 


subject  to  bT  X yyb  —  1.  By  Theorem  2.5  the  maximum  is  given  by  the  largest 
eigenvalue  of  the  matrix 

X yy  X^y. 


By  Corollary  2.2,  the  only  nonzero  eigenvalue  equals 


aT^XY^YY^YXa 


(16.11) 


(ii)  Maximise  (16.11)  over  a  subject  to  the  constraints  of  the  theorem.  Put  y  — 

1  /2 

X ffa  and  observe  that  (16.11)  equals 


T  v*  1/2 


y  '  EaTE XY^^rx^xx'y  =  Y  '  K  1  Ky. 


-1/2, 


,T  v-T 


Thus,  solve  the  equivalent  problem 


maxyTA/iy  (16.12) 

Y 


subject  to  yTy  —  1,  yj y  —  0  for  i  —  1, . . . ,  r  —  1. 
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Note  that  the  y\  ’s  are  the  eigenvectors  of  M\  corresponding  to  its  first  r  —  1 
largest  eigenvalues.  Thus,  as  in  Theorem  11.3,  the  maximum  in  (16.12)  is 
obtained  by  setting  y  equal  to  the  eigenvector  corresponding  to  the  rth  largest 
eigenvalue,  i.e.  y  —  yr  or  equivalently  a  —  ar.  This  yields 

C2(r)  =  yj  Afi  Yr  =  KyJ  Y  =  V- 


(iii)  Show  that  the  maximum  is  attained  for  a  —  ar  and  b  —  br.  From  the  SVD  of 
JC  we  conclude  that  1C8r  =  pr  yr  and  hence 


a 


T 


^Xybr  —  yr  /C5r  —  PrYr  Yr  —  Pr  • 


T, 


□ 


Let 


(x)  (Vxxl 1xyX\ 

UJ  IUJ'Uh  S ft ))' 

The  canonical  correlation  vectors 


_  ^_1/2 
cl  i  —  y i  ’ 

bi  -  X„/2$i 

maximise  the  correlation  between  the  canonical  variables 

m  =  ajx, 

<Pi  =  bj  Y. 


The  covariance  of  the  canonical  variables  rj  and  <p  is  given  in  the  next  theorem. 

Theorem  16.2  Let  rji  and  (pi  be  the  i  th  canonical  correlation  variables  ( i  — 
l, ...  ,k).  Define  rj  —  (rj  i , . . . ,  rjk )  and  cp  =  {cp\ , . . . ,  (pk).  Then 


Var 


(Zk  A 
VAX, 


with  A  given  in  (16.4). 

This  theorem  shows  that  the  canonical  correlation  coefficients,  p,  =  A-^2,  are 
the  covariances  between  the  canonical  variables  ly  and  (pi  and  that  the  indices  rj\  — 
ajx  and  cp\  —  bjY  have  the  maximum  covariance  —  p\ . 

The  following  theorem  shows  that  canonical  correlations  are  invariant  w.r.t.  linear 
transformations  of  the  original  variables. 
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Theorem  16.3  Let  X*  =  UT  X  +  u  and  7*  =  VT7  +  v  where  U  and  V  are 
nonsingular  matrices.  Then  the  canonical  correlations  between  X*  and  7*  are  the 
same  as  those  between  X  and  Y.  The  canonical  correlation  vectors  of  X*  and  7* 
are  given  by 


a*  —U  laj, 

b*  =  V~lbj.  (16.13) 


H1« 


'  Summary 

CCA  aims  to  identify  possible  links  between  two  (sub-)sets  of 
variables  X  e  Rq  and  Y  e  IRC  The  idea  is  to  find  indices  aT X 
and  bTY  such  that  the  correlation  p(a,b )  =  pa TXbTY  is  maximal. 

The  maximum  correlation  (under  constraints)  is  attained  by  setting 
<2/  =  Yxx2yi  and  b\  —  where  y,-  and<5?  denote  the  eigen¬ 

vectors  of  JCJCT  and  JCT JC,  JC  —  E^^ExyE^2  respectively. 

^  The  vectors  and  bi  are  called  canonical  correlation  vectors. 

^  The  indices  rji  —  aj  X  and  (pi  —  bj  Y  are  called  canonical 
correlation  variables. 

The  values  p\  —  . . . ,  pk  —  Vhi,  which  are  the  square 

roots  of  the  nonzero  eigenvalues  of  JCJCT  and  /CT/C,  are  called 
the  canonical  correlation  coefficients.  The  covariance  between 
the  canonical  correlation  variables  is  Cov(^?  ,  (pi)  =  i  — 

1  k 

The  first  canonical  variables,  rj i  —  a J X  and  (p\  —  bjY,  have  the 
maximum  covariance  y/X\. 

^  Canonical  correlations  are  invariant  w.r.t.  linear  transformations  of 
the  original  variables  X  and  Y . 


16.2  Canonical  Correlation  in  Practice 

In  practice  we  have  to  estimate  the  covariance  matrices  Exx,  Exr  and  Eyy-  Let  us 
apply  the  CCA  to  the  car  marks  data  (see  Table  22.7).  In  the  context  of  this  data 
set  one  is  interested  in  relating  price  variables  with  variables  such  as  sportiness 
and  safety.  In  particular,  we  would  like  to  investigate  the  relation  between  the  two 
variables  non- depreciation  of  value  and  price  of  the  car  and  all  other  variables. 
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Example  16.1  We  perform  the  CCA  on  the  data  matrices  A  and  y  that  correspond 
to  the  set  of  values  {Price,  Value  Stability}  and  {Economy,  Service,  Design,  Sporty 
car,  Safety,  Easy  handling},  respectively.  The  estimated  covariance  matrix  S  is 
given  by 


Price 

Value 

Econ. 

Serv.  Design 

Sport. 

Safety  Easy  h. 

1.41 

-1.11 

0.78 

-0.71  -0.90 

-1.04 

-0.95 

0.18\ 

-1.11 

1.19 

-0.42 

0.82 

0.77 

0.90 

1.12 

0.11 

0.78 

-0.42 

0.75 

-0.23  -0.45 

-0.42 

-0.28 

0.28 

-0.71 

0.82 

-0.23 

0.66 

0.52 

0.57 

0.85 

0.14 

-0.90 

0.77 

-0.45 

0.52 

0.72 

0.77 

0.68 

-0.10 

-1.04 

0.90 

-0.42 

0.57 

0.77 

1.05 

0.76 

-0.15 

-0.95 

1.12 

-0.28 

0.85 

0.68 

0.76 

1.26 

0.22 

0.18 

0.11 

0.28 

0.14  -0.10 

-0.15 

0.22 

0.32/ 

Hence, 


/ 


V 


1.41  - 

-i.m 

0.78 

-0.71  - 

-1.11 

1.19  ) 

1  ,  &XY  ~  ^ 

-0.42 

0.82 

0.75 

-0.23 

-0.45 

-0.42 

-0.28 

0.28  \ 

-0.23 

0.66 

0.52 

0.57 

0.85 

0.14 

-0.45 

0.52 

0.72 

0.77 

0.68 

-0.10 

-0.42 

0.57 

0.77 

1.05 

0.76 

-0.15 

-0.28 

0.85 

0.68 

0.76 

1.26 

0.22 

0.28 

0.14 

-0.10 

-0.15 

0.22 

0.32  / 

0.90  1.12  0.11 


It  is  interesting  to  see  that  value  stability  and  price  have  a  negative  covariance.  This 
makes  sense  since  highly  priced  vehicles  tend  to  loose  their  market  value  at  a  faster 
pace  than  medium  priced  vehicles. 

Now  we  estimate  /C  =  E x]/ 2  E xy  E ^ by 


A  o— 1/2  q—l/2 

/V  —  OXY  £>yy 


and  perform  a  singular  value  decomposition  of  /C: 


K  =  =  (gug2)  diag (4/2,4/2)  (dud2)T 
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Car  Marks  Data 


8 


Ferrari 


Wartburg 


Jaguar 

Trabant 


Lada 

Audi 


Rover 


BMW 


Mazda 

Citroen 


f^eniul?ll/litsubishi 


Hyundai 


OpellYBitffl  Merceaes 


Fiat 


Opel  Corsa 
VW  Passat 


8 


9  10  11  12  13  14 


Fig.  16.1  The  second  canonical  variables  for  the  car  marks  data  Q  textttMVAcancarm 


yv  /v  _  /v  /v  yv 

where  the  lj  ’s  are  the  eigenvalues  of  JCJC  and  JC  JC  with  rank(/C)  =  2,  and  gj  and 

yv  yv  _  yv  _  yv 

di  are  the  eigenvectors  of  JCJC  and  JC  1C ,  respectively.  The  canonical  correlation 
coefficients  are 


n  =  4/2  =  0.98,  r2  =  4/2  =  0.89. 

The  high  correlation  of  the  second  two  canonical  variables  can  be  seen  in  Fig.  16.1. 
The  second  canonical  variables  are 

fji  =  ajx  =  1.602  x\  +  1.686  x 2 

=  bjy  =  0.568  y\  +  0.544  —  0.012  y 3  —  0.096  34  —  0.014  3/5  +  0.915  ye. 

Note  that  the  variables  y\  (economy),  (service)  and  ye  (easy  handling)  have 
positive  coefficients  on  (p\.  The  variables  J3  (design),  y 4  (sporty  car)  and  ye  (safety) 
have  a  negative  influence  on  (p\ . 

The  canonical  variable  rj\  may  be  interpreted  as  a  price  and  value  index.  The 
canonical  variable  (p\  is  mainly  formed  from  the  qualitative  variables  economy, 
service  and  handling  with  negative  weights  on  design,  safety  and  sportiness.  These 
variables  may  therefore  be  interpreted  as  an  appreciation  of  the  value  of  the  car.  The 
sportiness  has  a  negative  effect  on  the  price  and  value  index,  as  do  the  design  and 
the  safety  features. 
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Testing  the  Canonical  Correlation  Coefficients 


The  hypothesis  that  the  two  sets  of  variables  X  and  y  are  uncorrelated  may  be  tested 
(under  normality  assumptions)  with  Wilks  likelihood  ratio  statistic  (Gibbins,  1985): 


^2/77  _ 


1  -  Syy  SyxSxx  Sxy 


-1 


7—1 


This  statistic  unfortunately  has  a  rather  complicated  distribution.  Bartlett  (1939) 
provides  an  approximation  for  large  n : 


k 

-{n-(p  +  q  +  3)/ 2} log]~~[(l  —  £,)  ~  x2pq-  (16.14) 

7=1 

A  test  of  the  hypothesis  that  only  s  of  the  canonical  correlation  coefficients  are 
nonzero  may  be  based  (asymptotically)  on  the  statistic 

k 

—  {n  —  (p  +  q  +  3)/ 2}  log  f\  l1  -  k)  ~  x\p-s)(q-sy  (16.15) 

7  =5+1 


Example  16.2  Consider  Example  16.1  again.  There  are  n  —  40  persons  that  have 
rated  the  cars  according  to  different  categories  with  p  —  2  and  q  —  6.  The  canonical 
correlation  coefficients  were  found  to  be  r\  —  0.98  and  f2  —  0.89.  Bartlett’s 
statistic  (16.14)  is  therefore 

-{40  -  (2  +  6  +  3) / 2}  log{ (1  -  0.982)(1  -  0.892)}  =  165.59  ~  x\2 

which  is  highly  significant  (the  99  %  quantile  of  the  /j2  is  26.23).  The  hypothesis 
of  no  correlation  between  the  variables  A  and  y  is  therefore  rejected. 

Let  us  now  test  whether  the  second  canonical  correlation  coefficient  is  different 
from  zero.  We  use  Bartlett’s  statistic  (16.15)  with  s  =  1  and  obtain 

-{40  -  (2  +  6  +  3) / 2}  log{(l  -  0.892)}  =  54.19  ~  + 
which  is  again  highly  significant  with  the  xl  distribution. 


CCA  with  Qualitative  Data 

The  canonical  correlation  technique  may  also  be  applied  to  qualitative  data. 
Consider  for  example  the  contingency  table  A f  of  the  French  baccalaureat  data.  The 
dataset  is  given  in  Table  22.8  in  Chap.  22.  The  CCA  cannot  be  applied  directly  to 
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this  contingency  table  since  the  table  does  not  correspond  to  the  usual  data  matrix 
structure.  We  may  wish,  however,  to  explain  the  relationship  between  the  row  r  and 
column  c  categories.  It  is  possible  to  represent  the  data  in  a  (n  x  (r  +  c))  data  matrix 
Z  —  (X,  30  where  n  is  the  total  number  of  frequencies  in  the  contingency  table  J\f 
and  X  and  y  are  matrices  of  zero-one  dummy  variables.  More  precisely,  let 

{1  if  the  kth  individual  belongs  to  the  i  th  row  category 
0  otherwise 


and 


!1  if  the  kth  individual  belongs  to  the  j  th  column  category 
0  otherwise 

where  the  indices  range  from  k  =  1 /  =  1 , . . . ,  r  and  j  —  1 , . . . ,  c.  Denote 
the  cell  frequencies  by  riy  so  that  A f  —  ( riy )  and  note  that 

x(i)yu)  =  "'/• 

where  xp)  (y(j))  denotes  the  i  th  (y  th)  column  of  X  Q7)- 


n  o\ 

c°\ 

/I  0  1  0\ 

1  0 

1  0 

10  10 

1  0 

1  0 

10  10 

1  0 

0  1 

10  0  1 

1  0 

,y  = 

0  1 

,  z  =  (x,  y)  = 

10  0  1 

0  1 

1  0 

0  110 

0  1 

0  1 

0  10  1 

0  1 

0  1 

0  10  1 

0  1 

0  1 

0  10  1 

VO  1  /  Vo  1  /  Vo  10  1/ 

Example  16.3  Consider  the  following  example  where 


The  matrices  X ,  y  and  Z  are  therefore 

The  element  n  n  of  A f  may  be  obtained  by  multiplying  the  first  column  of  X  with 
the  second  column  of  y  to  yield 


*<V<2)  =  2. 
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The  purpose  is  to  find  the  canonical  variables  rj  —  aT  x  and  cp  —  bT y  that  are 
maximally  correlated.  Note,  however,  that  x  has  only  one  nonzero  component  and 
therefore  an  “individual”  may  be  directly  associated  with  its  canonical  variables 
or  score  (< ai,bj ).  There  will  be  riy  points  at  each  (cij,bj)  and  the  correlation 
represented  by  these  points  may  serve  as  a  measure  of  dependence  between  the 
rows  and  columns  of  Af. 

Let  Z  —  (X,y)  denote  a  data  matrix  constructed  from  a  contingency  table  Af. 
Similar  to  Chap.  14  define 


d  —  x.j 

and  define  C  =  diag(c)  and  V  —  dia g(d).  Suppose  that  xz  .  >  0  and  x.j  >  0  for  all 
i  and  j .  It  is  not  hard  to  see  that 

nS  =  ZTHZ  =  ZTZ  -  nzzT  =  (  nSxx  'lSxY  ) 

\nSyx  nS yy  J 

n  \bC  —  n~lccT  Af  —  J\f  \ 
n  —  1/  \  AfTJ\fT  V  —  n~lddT) 

A  — i — 

where  Af  =  cd  /n  is  the  estimated  value  of  Af  under  the  assumption  of 
independence  of  the  row  and  column  categories. 

Note  that 

(n  —  l)Sxxlr  —  Cl  r  —  n~lccT  lr  —  c  —  c(n~lcT  lr)  —  c  —  c{n~ln)  —  0 

and  therefore  S^x  does  not  exist.  The  same  is  true  for  S ^ .  One  way  out  of  this 
difficulty  is  to  drop  one  column  from  both  X  and  y,  say  the  first  column.  Let  c  and 
d  denote  the  vectors  obtained  by  deleting  the  first  component  of  c  and  d . 

Define  C,  T>  and  SXx,  Syy ,  Sxy  accordingly  and  obtain 

(nSXx)  1  =C  1  +  C  l7 
(nSyr)-1  =f>~1  +n-}lclj 

so  that  (16.3)  exists.  The  score  associated  with  an  individual  contained  in  the  first 
row  (column)  category  of  Af  is  0. 

The  technique  described  here  for  purely  qualitative  data  may  also  be  used  when 
the  data  is  a  mixture  of  qualitative  and  quantitative  characteristics.  One  has  to  “blow 
up”  the  data  matrix  by  dummy  zero-one  values  for  the  qualitative  data  variables. 


= 

j=  1 

r 

=  J2nu- 

i  =  1 
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uu 


>  Summary 

^  In  practice  we  estimate  E^x,  E Xy ,  Eyy  by  the  empirical  covari¬ 
ances  and  use  them  to  compute  estimates  £/,  gi9  d\  for  A?  ,  y,  ,  <5/ 

from  the  SVD  of  JC  —  Sxx2  SxySyy^2  • 

The  signs  of  the  coefficients  of  the  canonical  variables  tell  us  the 
direction  of  the  influence  of  these  variables. 


16.3  Exercises 

Exercise  16.1  Show  that  the  eigenvalues  of  JCJCT  and  /CT/C  are  identical.  (Hint: 
Use  Theorem  2.6.) 

Exercise  16.2  Perform  the  CCA  for  the  following  subsets  of  variables:  X  corre¬ 
sponding  to  {price}  and  y  corresponding  to  {economy,  easy  handling}  from  the  car 
marks  data  ( Table  22.7). 

Exercise  16.3  Calculate  the  first  canonical  variables  for  Example  16.1.  Interpret 
the  coefficients. 

Exercise  16.4  Use  the  SVD  of  matrix  JC  to  show  that  the  canonical  variables  rj  \ 
and  r)2  are  not  correlated. 

Exercise  16.5  Verify  that  the  number  of  nonzero  eigenvalues  of  matrix  JC  is  equal 
to  rank(Exr). 

Exercise  16.6  Express  the  singular  value  decomposition  of  matrices  JC  and  JCT 
using  eigenvalues  and  eigenvectors  of  matrices  JCT  JC  and  JCJCT . 

Exercise  16.7  What  will  be  the  result  of  CCA  for  Y  —  X? 

Exercise  16.8  What  will  be  the  results  of  CCA  for  Y  —  2X  and  for  Y  —  —X? 

Exercise  16.9  What  results  do  you  expect  if  you  perform  CCA  for  X  and  Y  such 
that  Yjxy  —  0  ?  What  if  E^y  —  Tp  ? 


Chapter  17 

Multidimensional  Scaling 


One  major  aim  of  multivariate  data  analysis  is  dimension  reduction.  For  data 
measured  in  Euclidean  coordinates,  Factor  Analysis  and  Principal  Component 
Analysis  are  dominantly  used  tools.  In  many  applied  sciences  data  is  recorded  as 
ranked  information.  For  example,  in  marketing,  one  may  record  “product  A  is  better 
than  product  B”.  High-dimensional  observations  therefore  often  have  mixed  data 
characteristics  and  contain  relative  information  (w.r.t.  a  defined  standard)  rather 
than  absolute  coordinates  that  would  enable  us  to  employ  one  of  the  multivariate 
techniques  presented  so  far. 

Multidimensional  scaling  (MDS)  is  a  method  based  on  proximities  between 
objects,  subjects,  or  stimuli  used  to  produce  a  spatial  representation  of  these 
items.  Proximities  express  the  similarity  or  dissimilarity  between  data  objects.  It 
is  a  dimension  reduction  technique  since  the  aim  is  to  find  a  set  of  points  in 
low  dimension  (typically  two  dimensions)  that  reflect  the  relative  configuration 
of  the  high-dimensional  data  objects.  The  metric  MDS  is  concerned  with  such  a 
representation  in  Euclidean  coordinates.  The  desired  projections  are  found  via  an 
appropriate  spectral  decomposition  of  a  distance  matrix. 

The  metric  MDS  solution  may  result  in  projections  of  data  objects  that  conflict 
with  the  ranking  of  the  original  observations.  The  nonmetric  MDS  solves  this 
problem  by  iterating  between  a  monotising  algorithmic  step  and  a  least  squares 
projection  step.  The  examples  presented  in  this  chapter  are  based  on  reconstructing 
a  map  from  a  distance  matrix  and  on  marketing  concerns  such  as  ranking  of  the 
outfit  of  cars. 


17.1  The  Problem 

MDS  is  a  mathematical  tool  that  uses  proximities  between  objects,  subjects  or 
stimuli  to  produce  a  spatial  representation  of  these  items.  The  proximities  are 
defined  as  any  set  of  numbers  that  express  the  amount  of  similarity  or  dissimilarity 
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between  pairs  of  objects,  subjects  or  stimuli.  In  contrast  to  the  techniques  considered 
so  far,  MDS  does  not  start  from  the  raw  multivariate  data  matrix  A,  but  from  a 
(n  x  n)  dissimilarity  or  distance  matrix,  T> ,  with  the  elements  8y  and  dy  respectively. 
Hence,  the  underlying  dimensionality  of  the  data  under  investigation  is  in  general 
not  known . 

MDS  is  a  data  reduction  technique  because  it  is  concerned  with  the  problem  of 
finding  a  set  of  points  in  low  dimension  that  represents  the  “configuration”  of  data 
in  high  dimension.  The  “configuration”  in  high  dimension  is  represented  by  the 
distance  or  dissimilarity  matrix  V. 

MDS-techniques  are  often  used  to  understand  how  people  perceive  and  evaluate 
certain  signals  and  information.  For  instance,  political  scientists  use  MDS  tech¬ 
niques  to  understand  why  political  candidates  are  perceived  by  voters  as  being 
similar  or  dissimilar.  Psychologists  use  MDS  to  understand  the  perceptions  and 
evaluations  of  speech,  colours  and  personality  traits,  among  other  things.  Last  but 
not  least,  in  marketing  researchers  use  MDS  techniques  to  shed  light  on  the  way 
consumers  evaluate  brands  and  to  assess  the  relationship  between  product  attributes. 

In  short,  the  primary  purpose  of  all  MDS-techniques  is  to  uncover  structural 
relations  or  patterns  in  the  data  and  to  represent  it  in  a  simple  geometrical  model 
or  picture.  One  of  the  aims  is  to  determine  the  dimension  of  the  model  (the  goal  is  a 
low-dimensional,  easily  interpretable  model)  by  finding  the  d  -dimensional  space  in 
which  there  is  maximum  correspondence  between  the  observed  proximities  and  the 
distances  between  points  measured  on  a  metric  scale. 

MDS  based  on  proximities  is  usually  referred  to  as  metric  MDS,  whereas  the 
more  popular  nonmetric  MDS  is  used  when  the  proximities  are  measured  on  an 
ordinal  scale. 

Example  17.1  A  good  example  of  how  MDS  works  is  given  by  Dillon  and 
Goldstein  (1984)  (page  108).  Suppose  one  is  confronted  with  a  map  of  Germany 
and  asked  to  measure,  with  the  use  of  a  ruler  and  the  scale  of  the  map,  some  inter¬ 
city  distances.  Admittedly  this  is  quite  an  easy  exercise.  However,  let  us  now  reverse 
the  problem:  One  is  given  a  set  of  distances,  as  in  Table  17.1,  and  is  asked  to  recreate 
the  map  itself.  This  is  a  far  more  difficult  exercise,  though  it  can  be  solved  with  a 


Table  17.1  Inter-city  distances 


Berlin 

Dresden 

Hamburg 

Koblenz 

Munich 

Rostock 

Berlin 

0 

214 

279 

610 

596 

237 

Dresden 

0 

492 

533 

496 

444 

Hamburg 

0 

520 

772 

140 

Koblenz 

0 

521 

687 

Munich 

0 

771 

Rostock 

0 
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Fig.  17.1  Metric  MDS 
solution  for  the  inter-city 
road  distances  Q 
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Fig.  17.2  Metric  MDS 
solution  for  the  inter-city  road 
distances  after  reflection  and 
90°  rotation  Q 
MVAMD  S  c i t y 2 
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ruler  and  a  compass  in  two  dimensions.  MDS  is  a  method  for  solving  this  reverse 
problem  in  arbitrary  dimensions.  In  Figure  17.1  and  17.2  you  can  see  the  graphical 
representation  of  the  metric  MDS  solution  to  Table  17.1  after  rotating  and  reflecting 
the  points  representing  the  cities.  Note  that  the  distances  given  in  Table  17.1  are 
road  distances  that  in  general  do  not  correspond  to  Euclidean  distances.  In  real-life 
applications,  the  problems  are  exceedingly  more  complex:  there  are  usually  errors 
in  the  data  and  the  dimensionality  is  rarely  known  in  advance. 
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Table  17.2  Dissimilarities  for  cars 


Audi  100 

BMW  5 

Citroen  AX 

Ferrari 

Audi  100 

0 

2.232 

3.451 

3.689 

BMW  5 

2.232 

0 

5.513 

3.167 

Citroen  AX 

3.451 

5.513 

0 

6.202 

Ferrari 

3.689 

3.167 

6.202 

0 

• 

• 

• 

• 

• 

• 

Fig.  17.3  MDS  solution  on  Metric  MDS 

the  car  data  Q  MVAmdscarm 


co  - 

ferrari 

C\]  - 

wartburg 

aguar 

trabar 

rover 

lada 

bmw 

o  - 

audi  mazdacitroen 

mitsubishj  , 

nissan  fiat 

1 

mercedesogel^vectra  hyundai 

vw  passat. 

opel_corsa 

-4  -2  0  2  4 


x 


Example  17.2  A  further  example  is  given  in  Table  17.2  where  consumers  noted 
their  impressions  of  the  dissimilarity  of  certain  cars.  The  dissimilarities  in  this  table 
were  in  fact  computed  from  Table  22.7  as  Euclidean  distances 


8 

dy  =  y>-^)2- 

\l  i= i 

MDS  produces  Fig.  17.3  which  shows  a  non-linear  relationship  for  all  the  cars  in  the 
projection.  This  enables  us  to  build  a  non-linear  (quadratic)  index  with  the  Wartburg 
and  the  Trabant  on  the  left  and  the  Ferrari  and  the  Jaguar  on  the  right.  We  can 
construct  an  order  or  ranking  of  the  cars  based  on  the  subjective  impression  of  the 


consumers. 
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Fig.  17.4  Correlation 
between  the  MDS  direction 
and  the  variables  Q 
MVAmdscarm 


Correlations  MDS/Variables 


What  does  the  ranking  describe?  The  answer  is  given  by  Fig.  17.4  which  shows 
the  correlation  between  the  MDS  projection  and  the  variables.  Apparently,  the 
first  MDS  direction  is  highly  correlated  with  service(— ),  value(— ),  design(— ), 
sportiness(— ),  safety(— )  and  price(+).  We  can  interpret  the  first  direction  as  the 
price  direction  since  a  bad  mark  in  price  (“high  price”)  obviously  corresponds  with 
a  good  mark,  say,  in  sportiness  (“very  sportive”).  The  second  MDS  direction  is 
highly  positively  correlated  with  practicability.  We  observe  from  this  data  an  almost 
orthogonal  relationship  between  price  and  practicability. 

In  MDS  a  map  is  constructed  in  Euclidean  space  that  corresponds  to  given 
distances.  Which  solution  can  we  expect?  The  solution  is  determined  only  up  to 
rotation,  reflection  and  shifts.  In  general,  if  Pi , . . . ,  Pn  with  coordinates  = 
(xn, ,  XiP)T  for  i  =  1 , ,n  represents  a  MDS  solution  in  p  dimensions,  then 
yi  —  Axj  +  b  with  an  orthogonal  matrix  A  and  a  shift  vector  b  also  represents  a 
MDS  solution.  A  comparison  of  Figs.  17.1  and  17.2  illustrates  this  fact. 

Solution  methods  that  use  only  the  rank  order  of  the  distances  are  termed 
nonmetric  methods  of  MDS.  Methods  aimed  at  finding  the  points  P/  directly  from  a 
distance  matrix  like  the  one  in  the  Table  17.2  are  called  metric  methods. 
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'  Summary 

^  MDS  is  a  set  of  techniques  which  use  distances  or  dissimilarities 
to  project  high-dimensional  data  into  a  low-dimensional  space 
essential  in  understanding  respondents  perceptions  and  evaluations 
for  all  sorts  of  items. 

^  MDS  starts  with  a  (n  x  n)  proximity  matrix  V  consisting  of 
dissimilarities  8j  j  or  distances  d^. 

^  MDS  is  an  explorative  technique  and  focuses  on  data  reduction. 

^  The  MDS-solution  is  indeterminate  with  respect  to  rotation,  reflec¬ 
tion  and  shifts. 

^  The  MDS -techniques  are  divided  into  metric  MDS  and  nonmetric 
MDS. 


17.2  Metric  MDS 

Metric  MDS  begins  with  a  (n  x  n)  distance  matrix  V  with  elements  dij  where  i,  j  — 
1 , ...  ,n.  The  objective  of  metric  MDS  is  to  find  a  configuration  of  points  in  77- 
dimensional  space  from  the  distances  between  the  points  such  that  the  coordinates 
of  the  n  points  along  the  p  dimensions  yield  a  Euclidean  distance  matrix  whose 
elements  are  as  close  as  possible  to  the  elements  of  the  given  distance  matrix  V. 


The  Classical  Solution 

The  classical  solution  is  based  on  a  distance  matrix  that  is  computed  from  a 
Euclidean  geometry. 

Definition  17.1  A  (n  x  n)  distance  matrix  V  —  ( d \j)  is  Euclidean  if  for  some  points 

x\, . . . ,  xn  e  d-j  —  (x?-  —  xj)T(xi  —  xj). 

The  following  result  tells  us  whether  a  distance  matrix  is  Euclidean  or  not. 

Theorem  17.1  Define  A  —  (< aij),ay  —  —\d^  and  B  =  EL  Adi  where  EL  is  the 
centering  matrix.  V  is  Euclidean  if  and  only  if  B  is  positive  semidefinite.  If  V  is 
the  distance  matrix  of  a  data  matrix  A,  then  B  =  ELAATEL.  B  is  called  the  inner 
product  matrix. 


17.2  Metric  MDS 
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Recovery  of  Coordinates 

The  task  of  MDS  is  to  find  the  original  Euclidean  coordinates  from  a  given  distance 
matrix.  Let  the  coordinates  of  n  points  in  a  p  dimensional  Euclidean  space  be  given 
by  Xj  (i  —  1, . . . ,  n)  where  Xi  =  (xn, . . . ,  XiP)T .  Call  X  —  (x\, . . . ,  xn)T  the 
coordinate  matrix  and  assume  x  —  0.  The  Euclidean  distance  between  the  i  th  and 
j  th  points  is  given  by: 

p 

dfj  =  (17.1) 

k= 1 

The  general  b p  term  of  B  is  given  by: 

p 

by  =  J2xikxjk  =  xjxj.  (17.2) 

k  =  1 

It  is  possible  to  derive  B  from  the  known  squared  distances  dtj,  and  then  from  B  the 
unknown  coordinates. 


dfj  —  xj  Xi  +  xj  Xj  —  2  xj  Xj 

=  bu  +  bjj  —  2b  ij.  (17.3) 

Centering  of  the  coordinate  matrix  X  implies  that  by  —  0-  Summing  (17.3) 
over  i  and  j,  we  find: 


1 

n 


i  =  1 


1 

n 


* — t  * — f 

i = i 7=i 


7—1 


Solving  (17.3)  and  (17.4)  gives: 


(17.4) 


df.  -  dlj  +  dlm). 


(17.5) 
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1  O 

With  ciij  —  —  2  dfj,  and 


i  =  1 


?=1 7=1 


we  get: 


b[j  —  ^77  ^7*  ^•j  — I- 


(17.6) 


(17.7) 


Define  the  matrix  4l  as  (a,)),  and  observe  that: 

B  =  HAH.  (17.8) 

The  inner  product  matrix  B  can  be  expressed  as: 

B  =  XXT ,  (17.9) 

where  X  —  (x\, . . . ,  xn)T  is  the  ( n  x  p)  matrix  of  coordinates.  The  rank  of  B  is 
then 


rank(£>)  =  rank(A,A,T)  =  rank(^)  —  p.  (17.10) 

As  required  in  Theorem  17.1  the  matrix  B  is  symmetric,  positive  semidefinite  and 
of  rank  p ,  and  hence  it  has  p  non-negative  eigenvalues  and  n  —  p  zero  eigenvalues. 
B  can  now  be  written  as: 


B  —  TATt  (17.11) 

where  A  =  diag(Ai, . . . ,  A p),  the  diagonal  matrix  of  the  eigenvalues  of  B,  and 
r  =  (yu...  ,  yp),  the  matrix  of  corresponding  eigenvectors.  Hence  the  coordinate 
matrix  A  containing  the  point  configuration  in  is  given  by: 

A  =  TAl2. 


(17.12) 


17.2  Metric  MDS 
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How  Many  Dimensions? 


The  number  of  desired  dimensions  is  small  in  order  to  provide  practical  interpreta¬ 
tions,  and  is  given  by  the  rank  of  B  or  the  number  of  nonzero  eigenvalues  A/ .  If  B  is 
positive  semidefinite,  then  the  number  of  nonzero  eigenvalues  gives  the  number  of 
eigenvalues  required  for  representing  the  distances  dy. 

The  proportion  of  variation  explained  by  p  dimensions  is  given  by 


YJUh 

e;=:  a,.  • 


(17.13) 


It  can  be  used  for  the  choice  of  p.  If  B  is  not  positive  semidefinite  we  can 
modify  (17.13)  to 


Ef=i  A; 


^(“positive  eigenvalues”) 


(17.14) 


In  practice  the  eigenvalues  A;  are  almost  always  unequal  to  zero.  To  be  able  to 
represent  the  objects  in  a  space  with  dimensions  as  small  as  possible  we  may  modify 
the  distance  matrix  to: 


with 


V : 


=  d* 

uij 


(17.15) 


(0  ;i=j 

(  dy  +  e  >  0  ;  i  ^  j 


(17.16) 


where  e  is  determined  such  that  the  inner  product  matrix  B  becomes  positive 
semidefinite  with  a  small  rank. 


Similarities 

In  some  situations  we  do  not  start  with  distances  but  with  similarities.  The  standard 
transformation  (see  Chap.  13)  from  a  similarity  matrix  C  to  a  distance  matrix  V  is: 

dij  =  ( ca  -  2 c tj  +  Cjj)l2.  (17.17) 

Theorem  17.2  IfC  5  then  the  distance  matrix  T)  defined  by  (1 7.1 7)  is  Euclidean 
with  centred  inner  product  matrix  B  —  ELCEL. 
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Relation  to  Factorial  Analysis 

Suppose  that  the  ( n  x  p )  data  matrix  X  is  centred  so  that  XT  X  equals  a  multiple 
of  the  covariance  matrix  nS.  Suppose  that  the  p  eigenvalues  X\ , . . . ,  Xp  of  nS  are 
distinct  and  non  zero.  Using  the  duality  Theorem  10.4  of  factorial  analysis  we  see 
that  X\ , . . . ,  Xp  are  also  eigenvalues  of  XXT  —  B  when  V  is  the  Euclidean  distance 
matrix  between  the  rows  of  X .  The  -dimensional  solution  to  the  metric  MDS 
problem  is  thus  given  by  the  k  first  principal  components  of  X. 


Optimality  Properties  of  the  Classical  MDS  Solution 

Let  X  be  a  (n  x  p)  data  matrix  with  some  inter-point  distance  matrix  V.  The 
objective  of  MDS  is  thus  to  find  X\ ,  a  representation  of  A'  in  a  lower  dimensional 
Euclidean  space  whose  inter-point  distance  matrix  V\  is  not  far  from  V.  Let 
C  =  (C\,  C2)  be  a  (p  x  p)  orthogonal  matrix  where  C\  is  (p  x  k).  X\  —  XC\ 
represents  a  projection  of  X  on  the  column  space  of  C\ ;  in  other  words,  X\  may  be 
viewed  as  a  fitted  configuration  of  X  in  Rk .  A  measure  of  discrepancy  between  V 

and  V\  —  (dip)  is  given  by 


<P=J2(dij-  d^f.  (17.18) 

i,j  =  1 

Theorem  17.3  Among  all  projections  XC\  of  X  onto  k -dimensional  subspaces 
ofRp  the  quantity  <p  in  (17.18)  is  minimised  when  A  is  projected  onto  its  first  k 
principal  factors. 

We  see  therefore  that  the  metric  MDS  is  identical  to  principal  factor  analysis  as  we 
have  defined  it  in  Chap.  10. 


Summary 

x  * 

Metric  MDS  starts  with  a  distance  matrix  V. 

^  The  aim  of  metric  MDS  is  to  construct  a  map  in  Euclidean  space 
that  corresponds  to  the  given  distances. 

17.3  Nonmetric  MDS 
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Summary  (continued) 

A  practical  algorithm  is  given  as: 

1 .  start  with  distances  dy 

2.  define  A  — 

Z  IJ 

3.  put  B  =  {a,ij  —  at.  —  a.j  +  a..) 

4.  find  the  eigenvalues  X\, ...  ,XP  and  the  associated  eigenvec¬ 
tors  Yi,...,YP  where  the  eigenvectors  are  normalised  so  that 

y7y>  =  !• 

5.  Choose  an  appropriate  number  of  dimensions  p  (ideally  p  —  2) 

6.  The  coordinates  of  the  n  points  in  the  Euclidean  space  are  given 
by  Xy  =  YijXj  for  i  =  1 and  j  —  1 , . . . ,  p. 

Metric  MDS  is  identical  to  principal  components  analysis. 


17.3  Nonmetric  MDS 

The  object  of  nonmetric  MDS,  as  well  as  of  metric  MDS,  is  to  find  the  coordinates 
of  the  points  in  -dimensional  space,  so  that  there  is  a  good  agreement  between  the 
observed  proximities  and  the  inter-point  distances.  The  development  of  nonmetric 
MDS  was  motivated  by  two  main  weaknesses  in  the  metric  MDS  (Fahrmeir  & 
Hamerle,  1984,  p.  679): 

1.  the  definition  of  an  explicit  functional  connection  between  dissimilarities  and 
distances  in  order  to  derive  distances  out  of  given  dissimilarities,  and 

2.  the  restriction  to  Euclidean  geometry  in  order  to  determine  the  object  configura¬ 
tions. 

The  idea  of  a  nonmetric  MDS  is  to  demand  a  less  rigid  relationship  between  the 
dissimilarities  and  the  distances.  Suppose  that  an  unknown  monotonic  increasing 
function  /, 


dij  =  (17.19) 

is  used  to  generate  a  set  of  distances  dy  as  a  function  of  given  dissimilarities  8y.  Here 
/  has  the  property  that  if  8y  <  8rs ,  then  / (8y)  <  f  (Srs).  The  scaling  is  based  on  the 
rank  order  of  the  dissimilarities.  Nonmetric  MDS  is  therefore  ordinal  in  character. 

The  most  common  approach  used  to  determine  the  elements  dy  and  to  obtain 
the  coordinates  of  the  objects  x\,  X2, . . . ,  xn  given  only  rank  order  information  is  an 
iterative  process  commonly  referred  to  as  the  Shepard-Kruskal  algorithm. 
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Shepard-Kruskal  Algorithm 

In  a  first  step,  called  the  initial  phase,  we  calculate  Euclidean  distances  dp  from 
an  arbitrarily  chosen  initial  configuration  Xq  in  dimension  p*,  provided  that  all 
objects  have  different  coordinates.  One  might  use  metric  MDS  to  obtain  these 
initial  coordinates.  The  second  step  or  nonmetric  phase  determines  disparities  dp 

from  the  distances  dp  by  constructing  a  monotone  regression  relationship  between 
the  d^'  s  and  8 if  s,  under  the  requirement  that  if  8y  <  8 

rs ?  then  dp  <  drs  . 

This  is  called  the  weak  monotonicity  requirement.  To  obtain  the  disparities  dp, 
a  useful  approximation  method  is  the  pool-adjacent  violators  (PAV)  algorithm  (see 
Fig.  17.5).  Let 

0'iji)  >  O2J2)  >  •••  >  iikjk)  (17.20) 

be  the  rank  order  of  dissimilarities  of  the  k  =  n(n  —  l)/2  pairs  of  objects.  This 
corresponds  to  the  points  in  Fig.  17.6.  The  PAV  algorithm  is  described  as  follows: 
“beginning  with  the  lowest  ranked  value  of  8y,  the  adjacent  dp  values  are  compared 
for  each  8y  to  determine  if  they  are  monotonically  related  to  the  8f  s.  Whenever 

a  block  of  consecutive  values  of  dp  are  encountered  that  violate  the  required 

monotonicity  property  the  dp  values  are  averaged  together  with  the  most  recent 

non- violator  dp  value  to  obtain  an  estimator.  Eventually  this  value  is  assigned  to 
all  points  in  the  particular  block”. 


Pool-Adjacent-Violator-Algorithm 


Fig.  17.5  Pool- adjacent 
violators  algorithm  Q 
MVAMDSpooladj 


5 


Rank 


10 


15 


17.3  Nonmetric  MDS 


467 


Fig.  17.6  Ranks  and 
distances  Q 
MVAMDSnonmstart 


Monotonic  Regression 


Table  17.3  Dissimilarities 
8y  for  car  marks 


1 

2 

3 

4 

i 

j 

Mercedes 

Jaguar 

Ferrari 

VW 

1 

Mercedes 

— 

2 

Jaguar 

3 

— 

3 

Ferrari 

2 

1 

— 

4 

VW 

5 

4 

6 

— 

In  a  third  step,  called  the  metric  phase,  the  spatial  configuration  of  Xq  is  altered  to 
obtain  X\.  From  X\  the  new  distances  can  be  obtained  which  are  more  closely 

v 

^  (0) 

related  to  the  disparities  d\-  from  step  two. 

Example  17.3  Consider  a  small  example  with  4  objects  based  on  the  car  marks  data 
set,  see  (Table  17.3).  Our  aim  is  to  find  a  representation  with  p*  =  2  via  MDS. 
Suppose  that  we  choose  as  an  initial  configuration  (Fig.  17.7)  of  X0  the  coordinates 
given  in  Table  17.4.  The  corresponding  distances  dy  —  yj (xj  —  Xj)T(xi  —  ~xj)  are 
calculated  in  Table  17.5 

A  plot  of  the  dissimilarities  of  Table  17.5  against  the  distance  yields  Fig.  17.8. 
This  relation  is  not  satisfactory  since  the  ranking  of  the  8y  did  not  result  in  a 
monotone  relation  of  the  corresponding  distances  dy.  We  apply  therefore  the  PAV 
algorithm. 

The  first  violator  of  monotonicity  is  the  second  point  (1,3).  Therefore  we  average 
the  distances  d\ 3  and  d 23  to  obtain  the  disparities 


dn  —  dn  — 


d  13  +  ^23 


2.2 +  4.1 


3.17. 


2 


2 
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Fig.  17.7  Initial 

configuration  of  the  MDS  8 
of  the  car  data  Q 
MVAnmdscarl 

6 


4 


2 


0 

0  2  4  6  8  10  12 


Initial  Configuration 


Jaguar 


VW 


Ferrari 


Mercedes 


Table  17.4  Initial 
coordinates  for  MDS 


i 

Xn 

Xi2 

1 

Mercedes 

3 

2 

2 

Jaguar 

2 

1 

3 

Ferrari 

1 

3 

4 

VW 

10 

4 

Table  17.5  Ranks  and 
distances 


Uj 

dij 

rank  (dy) 

h 

1,2 

5.1 

3 

3 

1,3 

2.2 

1 

2 

1,4 

7.3 

4 

5 

2,3 

4.1 

2 

1 

2,4 

8.5 

5 

4 

3,4 

9.1 

6 

6 

Applying  the  same  procedure  to  (2, 4)  and  (1,4)  we  obtain  r/2 4  =  du  —  7.9.  The 


plot  of  Sy  versus  the  disparities  dy  represents  a  monotone  regression  relationship. 

In  the  initial  configuration  (Fig.  17.7),  the  third  point  (Ferrari)  could  be  moved 
so  that  the  distance  to  object  2  (Jaguar)  is  reduced.  This  procedure  however  also 
alters  the  distance  between  objects  3  and  4.  Care  should  be  given  when  establishing 
a  monotone  relation  between  Sy  and  dg. 


In  order  to  assess  how  well  the  derived  configuration  fits  the  given  dissimilarities 
Kruskal  suggests  a  measure  called  STRESS  1  that  is  given  by 


STRESS  1  = 


(17.21) 
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Dissimilarities  and  Distances 


Fig.  17.8  Scatterplot  of  dissimilarities  against  distances  Q  MVAnmdscar2 


Table  17.6  STRESS  calculations  for  car  marks  example 


(ij) 

*9 

dij 

4 

( dij) 

d2- 

(dij  -  d)2 

(2,3) 

1 

4.1 

3.15 

0.9 

16.8 

3.8 

(1,3) 

2 

2.2 

3.15 

0.9 

4.8 

14.8 

(1,2) 

3 

5.1 

5.1 

0 

26.0 

0.9 

(2,4) 

4 

8.5 

7.9 

0.4 

72.3 

6.0 

(1,4) 

5 

7.3 

7.9 

0.4 

53.3 

1.6 

(3,4) 

6 

9.1 

9.1 

0 

82.8 

9.3 

E 

36.3 

2.6 

256.0 

36.4 

An  alternative  stress  measure  is  given  by 


STRESS2  = 


/E,<;(4-4)2y 


(17.22) 


where  d  denotes  the  average  distance. 

Example  17.4  Table  17.6  presents  the  STRESS  calculations  for  the  car  example. 
The  average  distance  is  d  —  36A/6  —  6.1.  The  corresponding  STRESS 


measures  are: 


STRESS  1  =  ^2. 6/256  =  0.1 
STRESS2  =  ^2.6/36.4  =  0.27. 
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The  goal  is  to  find  a  point  configuration  that  balances  the  effects  STRESS  and 
non  monotonicity.  This  is  achieved  by  an  iterative  procedure.  More  precisely,  one 
defines  a  new  position  of  object  i  relative  to  object  j  by 


NEW 

+7 


Xu  +  a 


( Xji  —  Xu), 


l  =  1, 


P 


* 


(17.23) 


Here  a  denotes  the  step  width  of  the  iteration. 

By  (17.23)  the  configuration  of  object  i  is  improved  relative  to  object  j .  In  order 
to  obtain  an  overall  improvement  relative  to  all  remaining  points  one  uses: 


NEW 

xil 


n 

Of 

Xil  H - 7 

n  —  1 

7  =  1 J  +' 


(Xjl  -  Xu), 


l  =  1, 


p 


* 


(17.24) 


The  choice  of  step  width  a  is  crucial.  Kruskal  proposes  a  starting  value  of  a  —  0.2. 
The  iteration  is  continued  by  a  numerical  approximation  procedure,  such  as  steepest 
descent  or  the  Newton-Raphson  procedure. 

In  a  fourth  step,  the  evaluation  phase,  the  STRESS  measure  is  used  to  evaluate 
whether  or  not  its  change  as  a  result  of  the  last  iteration  is  sufficiently  small  that 
the  procedure  is  terminated.  At  this  stage  the  optimal  fit  has  been  obtained  for  a 
given  dimension.  Hence,  the  whole  procedure  needs  to  be  carried  out  for  several 
dimensions. 

Example  17.5  Let  us  compute  the  new  point  configuration  for  i  —  3  (Ferrari) 
(Fig.  17.9).  The  initial  coordinates  from  Table  17.4  are 


X31  =  1  and  X32  =  3. 


Applying  (17.24)  yields  (for  a  =  3): 


»+(i-!9<>o-.) 

-  0.37. 

Similarly  we  obtain  x^?EW  =  4.36. 

To  find  the  appropriate  number  of  dimensions,  7?*,  a  plot  of  the  minimum 
STRESS  value  as  a  function  of  the  dimensionality  is  made.  One  possible  criterion 
in  selecting  the  appropriate  dimensionality  is  to  look  for  an  elbow  in  the  plot.  A  rule 


x 


NEW 

31 


=  1  + 


4- 


T  £ 


=  1+1- 


7=1,7  7^3 

3.15 


1  - !  1  ^  » 


2.2 


,  x  f  3.15  t  , 

(3  -  1)  +  [  1  -  —  ]  (2 


=  1  -0.86  +  0.23  +  0 
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8 


6 


4 


2 


0 

0  2  4  6  8  10  12 

Fig.  17.9  First  iteration  for  Ferrari  using  Shepard- Kruskal  algorithm  MVAnmdscar3 

of  thumb  that  can  be  used  to  decide  if  a  STRESS  value  is  sufficiently  small  or  not  is 
provided  by  Kruskal: 

S  >20%,  poor;  S  =  10%,  fair;  S  <  5%,  good;  S  —  0,  perfect.  (17.25) 


First  Iteration  for  Ferrari 


n - 1 - 1 - 1 - r 


Jaguar 


Ferrari  New 


Ferrari  Init 


VW 


Mercedes 


J _ I _ I _ I _ L 


HW  % 

Summary 

/  A 

Nonmetric  MDS  is  only  based  on  the  rank  order  of  dissimilarities. 

The  object  of  nonmetric  MDS  is  to  create  a  spatial  representation 
of  the  objects  with  low  dimensionality. 

A  practical  algorithm  is  given  as: 


1.  Choose  an  initial  configuration. 

2.  Find  dy  from  the  configuration. 

3.  Fit  dij,  the  disparities,  by  the  PAV  algorithm. 

4.  Find  a  new  configuration  Xn+\  by  using  the  steepest  descent. 

5.  Go  to  2. 
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17.4  Exercises 

Exercise  17.1  Apply  the  MDS  method  to  the  Swiss  bank  note  data.  What  do  you 
expect  to  see  ? 

Exercise  17.2  Using  (17.6),  show  that  (17.7)  can  be  written  in  the  form  (17.2). 

Exercise  17.3  Show  that 

1.  bu  —  a.m  —  2 ai%’,  by  =  a^  —  a\.  —  a.j  +  a.m;  i  ^  j 

2.  B  =  Yy=  1  Xtxj 

2  spn  1  _  sp"  U  _  1 

z^i  =  \  A1  —  2-^ti  =  1  UU  — 

Exercise  17.4  Redo  a  careful  analysis  of  the  car  marks  data  based  on  the  following 
dissimilarity  matrix: 


j 

1 

2 

3 

4 

i 

Nissan 

Kia 

BMW 

Audi 

1 

Nissan 

— 

2 

Kia 

2 

— 

3 

BMW 

4 

6 

— 

4 

Audi 

3 

5 

1 

— 

Exercise  17.5  Apply  the  MDS  method  to  the  US  health  data.  Is  the  result  in 
accordance  with  the  geographic  location  of  the  US  states? 

Exercise  17.6  Redo  Exercise  17.5  with  the  US  crime  data. 

Exercise  17.7  Perform  the  MDS  analysis  on  the  Athletic  Records  data  in 
Sect.  22.18.  Can  you  see  which  countries  are  “close  to  each  other  ”? 


Chapter  18 

Conjoint  Measurement  Analysis 


Conjoint  Measurement  Analysis  plays  an  important  role  in  marketing.  In  the  design 
of  new  products  it  is  valuable  to  know  which  components  carry  what  kind  of  utility 
for  the  customer.  Marketing  and  advertisement  strategies  are  based  on  the  perception 
of  the  new  product’s  overall  utility.  It  can  be  valuable  information  for  a  car  producer 
to  know  whether  a  change  in  sportiness  or  a  change  in  safety  or  comfort  equipment  is 
perceived  as  a  higher  increase  in  overall  utility.  The  Conjoint  Measurement  Analysis 
is  a  method  for  attributing  utilities  to  the  components  (part  worths)  on  the  basis  of 
ranks  given  to  different  outcomes  (stimuli)  of  the  product.  An  important  assumption 
is  that  the  overall  utility  is  decomposed  as  a  sum  of  the  utilities  of  the  components. 

In  Sect.  18.1  we  introduce  the  idea  of  Conjoint  Measurement  Analysis.  We  give 
two  examples  from  the  food  and  car  industries.  In  Sect.  18.2  we  shed  light  on  the 
problem  of  designing  questionnaires  for  ranking  different  product  outcomes.  In 
Sect.  18.3  we  see  that  the  metric  solution  of  estimating  the  part- worths  is  given 
by  solving  a  least  squares  problem.  The  estimated  preference  ordering  may  be 
nonmonotone.  The  nonmetric  solution  strategy  takes  care  of  this  inconsistency  by 
iterating  between  a  least  squares  solution  and  the  pool  adjacent  violators  algorithm. 


18.1  Introduction 

In  the  design  and  perception  of  new  products  it  is  important  to  specify  the  contri¬ 
butions  made  by  different  facets  or  elements.  The  overall  utility  and  acceptance  of 
such  a  new  product  can  then  be  estimated  and  understood  as  a  possibly  additive 
function  of  the  elementary  utilities.  Examples  are  the  design  of  cars,  a  food  article 
or  the  program  of  a  political  party.  For  a  new  type  of  margarine  one  may  ask 
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whether  a  change  in  taste  or  presentation  will  enhance  the  overall  perception  of 
the  product.  The  elementary  utilities  are  here  the  presentation  style  and  the  taste 
(e.g.  calory  content).  For  a  party  program  one  may  want  to  investigate  whether  a 
stronger  ecological  or  a  stronger  social  orientation  gives  a  better  overall  profile  of 
the  party.  For  the  marketing  of  a  new  car  one  may  be  interested  in  whether  this  new 
car  should  have  a  stronger  active  safety  or  comfort  equipment  or  a  more  sporty  note 
or  combinations  of  both. 

In  Conjoint  Measurement  Analysis  one  assumes  that  the  overall  utility  can  be 
explained  as  an  additive  decomposition  of  the  utilities  of  different  elements.  In  a 
sample  of  questionnaires  people  ranked  the  product  types  and  thus  revealed  their 
preference  orderings.  The  aim  is  to  find  the  decomposition  of  the  overall  utility  on 
the  basis  of  observed  data  and  to  interpret  the  elementary  or  marginal  utilities. 

Example  18.1  A  car  producer  plans  to  introduce  a  new  car  with  features  that  appeal 
to  the  customer  and  that  may  help  in  promoting  future  sales.  The  new  elements 
that  are  considered  are  comfort/safety  components  (e.g.  active  steering  or  GPS)  and 
a  sporty  look  (leather  steering  wheel  and  additional  kW  of  the  engine).  The  car 
producer  has  thus  four  lines  of  cars. 


car  1: 

basic  safety  equipment 

and 

low  sportiness 

car  2: 

basic  safety  equipment 

and 

high  sportiness 

car  3: 

high  safety  equipment 

and 

low  sportiness 

car  4: 

high  safety  equipment 

and 

high  sportiness 

For  the  car  producer  it  is  important  to  rank  these  cars  and  to  find  out  customers’ 
attitudes  toward  a  certain  product  line  in  order  to  develop  a  suitable  marketing 
scheme.  A  tester  may  rank  the  cars  as  described  in  Table  18.1. 

The  elementary  utilities  here  are  the  comfort  equipment  and  the  level  of 
sportiness.  Conjoint  Measurement  Analysis  aims  at  explaining  the  rank  order  given 
by  the  test  person  as  a  function  of  these  elementary  utilities. 

Example  18.2  A  food  producer  plans  to  create  a  new  margarine  and  varies  the 
product  characteristics  “calories”  (low  vs.  high)  and  “presentation”  (a  plastic  pot 
vs.  paper  package)  (Backhaus,  Erichson,  Plinke,  &  Weiber,  1996).  We  can  view  this 
in  fact  as  ranking  four  products. 


product  1 :  low  calories  and  plastic  pot 

product  2:  low  calories  and  paper  package 

product  3:  high  calories  and  plastic  pot 

product  4:  high  calories  and  paper  package 
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Table  18.1  Tester’s  ranking 
of  cars 

Table  18.2  Tester’s  ranking 
of  margarine 


Car 

1 

2 

3 

4 

Ranking 

1 

2 

4 

3 

Product 

1 

2 

3 

4 

Ranking 

3 

4 

1 

2 

These  four  Active  products  may  now  be  ordered  by  a  set  of  sample  testers  as 
described  in  Table  18.2. 

The  Conjoint  Measurement  Analysis  aims  to  explain  such  a  preference  ranking 
by  attributing  part-worths  to  the  different  elements  of  the  product.  The  part- worths 
are  the  utilities  of  the  elementary  components  of  the  product. 

In  interpreting  the  part-worths  one  may  find  that  for  a  test  person  one  of  the 
elements  has  a  higher  value  or  utility.  This  may  lead  to  a  new  design  or  to  the 
decision  that  this  utility  should  be  emphasised  in  advertisement  schemes. 


uu 


'  I*-  Summary 

^  Conjoint  Measurement  Analysis  is  used  in  the  design  of  new 
products. 

^  Conj oint  Measurement  Analysis  tries  to  identify  part- worth  utilities 
that  contribute  to  an  overall  utility. 

The  part- worths  enter  additively  into  an  overall  utility. 

The  interpretation  of  the  part- worths  gives  insight  into  the  percep¬ 
tion  and  acceptance  of  the  product. 


18.2  Design  of  Data  Generation 

The  product  is  defined  through  the  properties  of  the  components.  A  stimulus  is 
defined  as  a  combination  of  the  different  components.  Examples  18.1  and  18.2  had 
four  stimuli  each.  In  the  margarine  example  they  were  the  possible  combinations  of 
the  factors  X\  (calories)  and  X2  (presentation).  If  a  product  property  such  as 

(1  bread 
2  cooking 
3  universal 


is  added,  then  there  are  3  •  2  •  2  =  12  stimuli. 
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For  the  automobile  Example  18.1  additional  characteristics  may  be  engine  power 
and  the  number  of  doors.  Suppose  that  the  engines  offered  for  the  new  car  have 
50,  70,  90 kW  and  that  the  car  may  be  produced  in  2-,  4-,  or  5-door  versions.  These 
categories  may  be  coded  as 


f  1  50  kW 

X3  (power  of  engine)  =  <  2  70  kW 

3  90  kW 


and 


!1  2  doors 
2  4  doors  . 

3  5  doors 

Both  X3  and  X4  have  three  factor  levels  each,  whereas  the  first  two  factors  X\ 
(safety)  and  X2  (sportiness)  have  only  two  levels.  Altogether  2  •  2  •  3  •  3  =  36  stimuli 
are  possible.  In  a  questionnaire  a  tester  would  have  to  rank  all  36  different  products. 

The  profile  method  asks  for  the  utility  of  each  stimulus.  This  may  be  time 
consuming  and  tiring  for  a  test  person  if  there  are  too  many  factors  and  factor  levels. 
Suppose  that  there  are  six  properties  of  components  with  three  levels  each.  This 
results  in  3 6  =  729  stimuli  (i.e.  729  different  products)  that  a  tester  would  have  to 
rank. 

The  two  factor  method  is  a  simplification  and  considers  only  two  factors 
simultaneously.  It  is  also  called  trade-off  analysis.  The  idea  is  to  present  just  two 
stimuli  at  a  time  and  then  to  recombine  the  information.  Trade-off  analysis  is 
performed  by  defining  the  trade-off  matrices  corresponding  to  stimuli  of  two  factors 
only. 

The  trade-off  matrices  for  the  levels  X\ ,  X2  and  X2  from  the  margarine  Example 
18.2  are  given  in  Table  18.3.  The  trade-off  matrices  for  the  new  car  outfit  are 
described  in  Tabel  18.4. 

The  choice  between  the  profile  method  and  the  trade-off  analysis  should  be 
guided  by  consideration  of  the  following  aspects: 

1 .  requirements  on  the  test  person, 

2.  time  consumption, 

3.  product  perception. 


*3 

Xi 

*3 

*2 

1 

1 

2 

1 

1 

2 

2 

1 

2 

2 

1 

2 

3 

1 

2 

3 

1 

2 

Table  18.3  Trade-off 
matrices  for  margarine 


18.2  Design  of  Data  Generation 


All 


Table  18.4  Trade-off 
matrices  for  car  design 


x4 

*3 

x4 

x2 

1 

1 

2 

3  1 

1 

2 

2 

1 

2 

3  2 

1 

2 

3 

1 

2 

3  3 

1 

2 

*3 

*2 

Xi 

1 

1 

2  1 

1 

2 

2 

1 

2  2 

1 

2 

3 

1 

2  3 

1 

2 

XA 

Xi 

1 

1 

2 

2 

1 

2 

3 

1 

2 

The  first  aspect  relates  to  the  ability  of  the  test  person  to  judge  the  different  stimuli. 
It  is  certainly  an  advantage  of  the  trade-off  analysis  that  one  only  has  to  consider 
two  factors  simultaneously.  The  two  factor  method  can  be  carried  out  more  easily 
in  a  questionnaire  without  an  interview. 

The  profile  method  incorporates  the  possibility  of  a  complete  product  perception 
since  the  test  person  is  not  confronted  with  an  isolated  aspect  (2  factors)  of  the 
product.  The  stimuli  may  be  presented  visually  in  its  final  form  (e.g.  as  a  picture). 
With  the  number  of  levels  and  properties  the  number  of  stimuli  rise  exponentially 
with  the  profile  method.  The  time  to  complete  a  questionnaire  is  therefore  a  factor 
in  the  choice  of  method. 

In  general  the  product  perception  is  the  most  important  aspect  and  is  therefore 
the  profile  method  that  is  used  the  most.  The  time  consumption  aspect  speaks  for  the 
trade-off  analysis.  There  exist,  however,  clever  strategies  on  selecting  representation 
subsets  of  all  profiles  that  bound  the  time  investment.  We  therefore  concentrate  on 
the  profile  method  in  the  following. 


1  3L  Summary 

A  stimulus  is  a  combination  of  different  properties  of  a  product. 

Conjoint  measurement  analysis  is  based  either  on  a  list  of  all  factors 
(profile  method)  or  on  trade-off  matrices  (two  factor  method). 

^  Trade-off  matrices  are  used  if  there  are  too  many  factor  levels. 

^  Presentation  of  trade-off  matrices  makes  it  easier  for  testers  since 
only  two  stimuli  have  to  be  ranked  at  a  time. 
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18.3  Estimation  of  Preference  Orderings 

On  the  basis  of  the  reported  preference  values  for  each  stimulus  conjoint  analysis 
determines  the  part- worths.  Conjoint  analysis  uses  an  additive  model  of  the  form 

J  Lj  Lj 

Yk  =  T:  T:  PjiKXj  =  Xji)  +  /x,  for  k  =  1, . . . ,  K  and  V  ;  ^  fo  =  0. 

j=\  i=\  i=i 

(18.1) 

Xj  (j  —  1 ,...,/)  denote  the  factors,  Xji  (/  =  1, . . . ,  Ly)  are  the  levels  of  each 
factor  Xy-  and  the  coefficients  fy  are  the  part- worths.  The  constant  fi  denotes  an 
overall  level  and  Y \  is  the  observed  preference  for  each  stimulus  and  the  total 
number  of  stimuli  are: 


j 

K=\\Lj. 

7=i 

Equation  (18.1)  is  without  an  error  term  for  the  moment.  In  order  to  explain 
how  (18.1)  may  be  written  in  the  standard  linear  model  form  we  first  concentrate  on 
J  —  2  factors.  Suppose  that  the  factors  engine  power  and  airbag  safety  equipment 
have  been  ranked  as  follows: 


Airbag 

1 

2 

Engine 

50  kW 

1 

1 

3 

70  kW 

2 

2 

6 

90  kW 

3 

4 

5 

There  are  K  —  6  preferences  altogether.  Suppose  that  the  stimuli  have  been 
sorted  so  that  Y\  corresponds  to  engine  level  1  and  airbag  level  1,  Y \  corresponds  to 
engine  level  1  and  airbag  level  2,  and  so  on.  Then  model  (18.1)  reads: 


Yi  -- 

=  fin 

+ 

^21 

+ 

fl 

y2  - 

-  Pn 

+ 

^22 

+ 

{l 

Y3  = 

-  Pn 

+ 

^21 

+ 

/ 1 

y4  = 

-  P\2 

+ 

P22 

+ 

11 

y5  -- 

~  Pl3 

+ 

^21 

+ 

\1 

y6  -- 

~  P\3 

+ 

P22 

+ 

[l. 

Now  we  would  like  to  estimate  the  part- worths  fyi. 
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Table  18.5  Ranked  products 


X2  (calories) 

Low 

High 

1 

2 

X\  (usage) 

Bread 

1 

2 

1 

Cooking 

2 

3 

4 

Universal 

3 

6 

5 

Example  18.3  In  the  margarine  example  let  us  consider  the  part- worths  of  X\  — 
usage  and  X2  =  calories.  We  have  in  =  1,  X12  =  2,  xu  —  3,  X21  —  1  and 
X22  —  2.  (We  momentarily  re-labeled  the  factors:  X3  became  X\.)  Hence  L\  —  3 
and  L2  —  2.  Suppose  that  a  person  has  ranked  the  six  different  products  as  in 
Table  18.5. 

If  we  order  the  stimuli  as  follows: 

Y\  =  Utility  (Xi  =  1  A  X2  =  1) 

Y2  =  Utility  {X\  =  1  a  X2  —  2) 

Y3  =  Utility  (X\  —2a  X2  =  1) 

Y4  =  Utility  (Xx  =  2  A  X2  =  2) 

Y5  =  Utility  (X\  =  3  A  X2  —  1) 

Y6  =  Utility  (Xi  =  3  A  X2  =  2) , 

we  obtain  from  Eq.  (18.1)  the  same  decomposition  as  above: 

Y\  —  Pn  +  P21  +  M 
Y2  =  fin  +  P22  +  M 
Y3  =  P12  +  P21  +  M 
Y4  =  P\2  +  p22  +  M 

T5  =  P13  +  ^21  +  l1 
Ye  —  P13  +  ^22  +  M- 

Our  aim  is  to  estimate  the  part- worths  ftji  as  well  as  possible  from  a  collection  of 
tables  like  Table  18.5  that  have  been  generated  by  a  sample  of  test  persons.  First,  the 
so-called  metric  solution  to  this  problem  is  discussed  and  then  a  non-metric  solution. 


Metric  Solution 

The  problem  of  conjoint  measurement  analysis  can  be  solved  by  the  technique  of 
Analysis  of  Variance  (ANOVA).  An  important  assumption  underlying  this  technique 
is  that  the  ‘‘distance”  between  any  two  adjacent  preference  orderings  corresponds  to 
the  same  difference  in  utility.  That  is,  the  difference  in  utility  between  the  products 
ranked  1st  and  2nd  is  the  same  as  the  difference  in  utility  between  the  products 
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Table  18.6  Metric  solution 
for  car  example 


X2  (airbags) 

Px  1. 

Pu 

1 

2 

X\  (engine) 

50  kW 

1 

1 

3 

2 

-1.5 

70  kW 

2 

2 

6 

4 

0.5 

90  kW 

3 

4 

5 

4.5 

1 

Px  2. 

2.33 

4.66 

3.5 

fill 

-1.16 

1.16 

ranked  4th  and  5th.  Put  differently,  we  treat  the  ranking  of  the  products — which  is  a 
cardinal  variable — as  if  it  were  a  metric  variable. 

Introducing  a  mean  utility  /x  Eq.  (18.1)  can  be  rewritten.  The  mean  utility  in  the 
above  Example  18.3  is  {i  —  (l  +  2  +  3  +  4  +  5  +  6)/6  =  21/6  =  3.5.  In  order 
to  check  the  deviations  of  the  utilities  from  this  mean,  we  enlarge  Table  18.5  by  the 
mean  utility  pXj.,  given  a  certain  level  of  the  other  factor.  The  metric  solution  for 
the  car  example  is  given  in  Table  18.6. 

Example  18.4  In  the  margarine  example  the  resulting  part- worths  for  /x  =  3.5  are 

ySn  =  -2  021  =  0.16 

Pn  —  0  P22  —  —0.16. 

ft  13  =  2 


Li 

J  /V 

Note  that  fiji  =  0  (j  =  1, . . . ,  J).  The  estimated  utility  Y\  for  the  product  with 
/= 1 

low  calories  and  usage  of  bread,  for  example,  is: 


Y\  —  j3\\  T  ^21  T  /x  —  — 2  T  0.16  T  3.5  —  1.66. 


The  estimated  utility  I4  for  product  4  (cooking  ( X\  —2)  and  high  calories  (X2  = 
2))  is: 


T4  —  T  @22  T  /x  —  0  —  0.16  T  3.5  —  3.33. 


The  coefficients  are  computed  as  pXj!  —  /x,  where  pXjl  is  the  average  preference 
ordering  for  each  factor  level.  For  instance,  pXn  =  1/2  *  (2  +  1)  =  1.5. 

The  fit  can  be  evaluated  by  calculating  the  deviations  of  the  fitted  values  to  the 
observed  preference  orderings.  In  the  rightmost  column  of  Table  18.8  the  quadratic 

/V 

deviations  between  the  observed  rankings  (utilities)  Yk  and  the  estimated  utilities  Y \ 
are  listed. 
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Table  18.7  Metric  solution 
for  Table  18.5 


X2  (calories) 

Px  1. 

fal 

Low 

High 

1 

2 

Xi  (usage) 

Bread 

1 

2 

1 

1.5 

-2 

Cooking 

2 

3 

4 

3.5 

0 

Universal 

3 

6 

5 

5.5 

2 

P.X2. 

3.66 

3.33 

3.5 

fal 

0.16 

-0.16 

Table  18.8  Deviations 
between  model  and  data 


Stimulus 

Yk 

A 

Yk 

Yk-Yk 

{Yk  -  Yk)2 

1 

2 

1.66 

0.33 

0.11 

2 

1 

1.33 

-0.33 

0.11 

3 

3 

3.66 

-0.66 

0.44 

4 

4 

3.33 

0.66 

0.44 

5 

6 

5.66 

0.33 

0.11 

6 

5 

5.33 

-0.33 

0.11 

E 

21 

21 

0 

1.33 

The  technique  described  that  generated  Table  18.7  is  in  fact  the  solution  to  a  least 
squares  problem.  The  conjoint  measurement  problem  (18.1)  may  be  rewritten  as  a 
linear  regression  model  (with  error  s  —  0): 

Y  —  Xf5  +  s  (18.2) 

with  T'  being  a  design  matrix  with  dummy  variables.  has  the  row  dimension 
j  j 

K  —  Y\  Lj  (the  number  of  stimuli)  and  the  column  dimension  D  —  ^  L  j  —  J . 

.7  =  1  .7  =  1 

The  reason  for  the  reduced  column  number  is  that  per  factor  only  ( Lj  —  1)  vectors 

are  linearly  independent.  Without  loss  of  generality  we  may  standardise  the  problem 
so  that  the  last  coefficient  of  each  factor  is  omitted.  The  error  term  s  is  introduced 
since  even  for  one  person  the  preference  orderings  may  not  fit  the  model  (18.1). 

Example  18.5  If  we  rewrite  the  /3  coefficients  in  the  form 


(Pi\ 

( ll  +  /3\3  +  fal  \ 

fa 

fax  —  fas 

fa 

fal  ~  fas 

W 

K  fal  ~  fal  / 

(18.3) 
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and  define  the  design  matrix  X  as 


/I 

1  0 

1\ 

1 

1  0 

0 

1 

0  1 

1 

1 

0  1 

0 

1 

0  0 

1 

VI 

0  0 

0/ 

(18.4) 


then  Eq.  (18.1)  leads  to  the  linear  model  (with  error  s  —  0): 

Y  =  Xp  +  s.  (18.5) 


The  least  squares  solution  to  this  problem  is  the  technique  used  for  Table  18.7. 
In  practice  we  have  more  than  one  person  to  answer  the  utility  rank  question  for 
the  different  factor  levels.  The  design  matrix  is  then  obtained  by  stacking  the  above 
design  matrix  n  times.  Hence,  for  n  persons  we  have  as  a  final  design  matrix: 


\ 


X*  =  ln®X  = 


>  n  —  times 


W 


J 

which  has  dimension  ( nK)(L  —  J )  (where  L  —  ^  L  j  )  and  7*  =  (7jT,  . . . ,  7;?T)T. 

.7  =  1 

The  linear  model  (18.5)  can  now  be  written  as: 

7*  =  X*/3  +  £*.  (18.6) 

Given  that  the  test  people  assign  different  rankings,  the  error  term  £*  is  a  necessary 
part  of  the  model. 

Example  18.6  If  we  take  the  p  vector  as  defined  in  (18.3)  and  the  design  matrix  X 
from  (18.4),  we  obtain  the  coefficients: 

P\  —  5.33  =  jl  +  /§  1 3  +  P22 

P 2  —  — 4  =  Pn  —  P\3 

P3  —  ~  2  =  P\2  —  P\3 

Pa  —  0.33  =  p>2\  —  P22 

E  Pn  =  0. 

/= 1 


(18.7) 
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Solving  (18.7)  we  have: 


Pw  =  P2  ~  3  (p2  +  (hj  —  —7 

P\2  —  P?>  —  3  (/?2  +  A3J  =  0 

4i3  =  -i(^2  +  ft)^  =  2  (188) 

P>2\  —  Pa  ~  \P>\  —  \P>\  —  0-16 

hi  =  -^4  =  -0.16 


A  —  At  +  I  ^2  +  /?3^  +  | (AO  —  3.5. 

In  fact,  we  obtain  the  same  estimated  part- worths  as  in  Table  18.7.  The  stimulus 

/V 

k  —  2  corresponds  to  adding  up  pn,P22,  and  fi  (see  (18.3)).  Adding  P\  and 

✓V 

p2  gives: 


Y2  =  5. 33  -  4  =  1.33. 


Nonmetric  Solution 

If  we  drop  the  assumption  that  utilities  are  measured  on  a  metric  scale,  we  have 
to  use  (18.1)  to  estimate  the  coefficients  from  an  adjusted  set  of  estimated  utilities. 
More  precisely,  we  may  use  the  monotone  ANOVA  as  developed  by  Kruskal  (1965). 
The  procedure  works  as  follows.  First,  one  estimates  model  (18.1)  with  the  ANOVA 

/\  /V 

technique  described  above.  Then  one  applies  a  monotone  transformation  Z  —  f{Y) 
to  the  estimated  stimulus  utilities.  The  monotone  transformation  /  is  used  because 

/V 

the  fitted  values  Y \  from  (18.2)  of  the  reported  preference  orderings  Y \  may  not  be 

/V  /\ 

monotone.  The  transformation  =  /  (Yk)  is  introduced  to  guarantee  monotonicity 
of  preference  orderings.  For  the  car  example  the  reported  Y \  values  were  Y  — 
(1,  3,  2,  6,  4,  5) T.  The  estimated  values  are  computed  as: 

Y]  =  -1.5-1.16  +  3.5  =  0.84 
Y2  =  -1.5  +  1.16  +  3.5  =  3.16 
Y3  =  -0.5-1.16  +  3.5  =  2.84 
Y4  =  -0.5  +  1.16  +  3.5  =  5.16 
Y5  =  1.5-1.16  +  3.5  =  3.34 

Y6  =  1.5+  1.16  +  3.5  =  5.66. 

If  we  make  a  plot  of  the  estimated  preference  orderings  against  the  revealed  ones, 
we  obtain  Fig.  18.1. 
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Fig.  18.1  Plot  of  estimated 
preference  orderings 
vs.  revealed  rankings  and 
PAV  fit  O 

MVAcar rankings 


Car  rankings 


We  see  that  the  estimated  Ye  —  5.16  is  below  the  estimated  Y5  —  5.66  and 
thus  an  inconsistency  in  ranking  the  utilities  occurs.  The  monotone  transformation 
Zk  =  f  (Yk)  is  introduced  to  make  the  relationship  in  Fig.  18.1  monotone.  A  very 

/V  /V 

simple  procedure  consists  of  averaging  the  “violators”  Ye  and  Y5  to  obtain  5.41.  The 
relationship  is  then  monotone  but  the  model  (18.1)  may  now  be  violated.  The  idea 
is  therefore  to  iterate  these  two  steps.  This  procedure  is  iterated  until  the  stress 
measure  (see  Chap.  17) 

K 

_  /V  /V 

E  (Zk  -  Yk)2 

STRESS  =  ^ -  (18.9) 

e  in  -  h2 

k  =  1 

is  minimised  over  /3  and  the  monotone  transformation  / .  The  monotone  transfor¬ 
mation  can  be  computed  by  the  so-called  pool-adjacent- violators  (PAV)  algorithm. 


18.4  Exercises 
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Summary 

^  The  part- worths  are  estimated  via  the  least  squares  method. 

^  The  metric  solution  corresponds  to  analysis  of  variance  in  a  linear 
model. 

^  The  non-metric  solution  iterates  between  a  monotone  regression 
curve  fitting  and  determining  the  part- worths  by  ANOVA  method¬ 
ology. 

^  The  fitting  of  data  to  a  monotone  function  is  done  via  the  PAV 
algorithm. 


18.4  Exercises 

Exercise  18.1  Compute  the  part-worths  for  the  following  table  of  rankings 


x2 

1 

2 

xx 

1 

1 

2 

2 

4 

3 

3 

6 

5 

Exercise  18.2  Consider  again  Example  18.5.  Rewrite  the  design  matrix  A  and  the 
parameter  vector  f  so  that  the  overall  mean  effect  p  is  part  of  X  and  ft,  i.e.  find  the 
matrix  X'  and  such  that  Y  —  X'  /3'. 

Exercise  18.3  Compute  the  design  matrix  for  Example  18.5  for  n  —  3  persons 
ranking  the  margarine  with  X\  and  X2. 

Exercise  18.4  Construct  an  analog  for  Table  18.8  for  the  car  example. 

Exercise  18.5  Compute  the  part-worths  on  the  basis  of  the  following  tables  of 
rankings  observed  on  n  =  3  persons. 


x2 

*2 

x2 

X\ 

1 

1 

2 

1 

3 

X\ 

3 

1 

2 

4 

3 

4 

2 

5 

2 

3 

6 

5 

5 

6 

6 

4 
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Exercise  18.6  Suppose  that  in  the  car  example  a  person  has  ranked  cars  by  the 
profile  method  on  the  following  characteristics: 


Xi 

=  motor 

x2 

=  safety 

Xi 

=  doors 

Xi  x2 

*3 

Preference 

1  1 

1 

1 

1  1 

2 

3 

1  1 

3 

2 

1  2 

1 

5 

1  2 

2 

4 

1  2 

3 

6 

Xi  x2 

*3 

Preference 

2  1 

1 

7 

2  1 

2 

8 

2  1 

3 

9 

2  2 

1 

10 

2  2 

2 

12 

2  2 

3 

11 

Xi  x2 

*3 

Preference 

3  1 

1 

13 

3  1 

2 

15 

3  1 

3 

14 

3  2 

1 

16 

3  2 

2 

17 

3  2 

3 

18 

There  are  k  —  18  stimuli. 

Estimate  and  analyse  the  part-worths. 


Chapter  19 

Applications  in  Finance 


A  portfolio  is  a  linear  combination  of  assets.  Each  asset  contributes  with  a  weight 
Cj  to  the  portfolio.  The  performance  of  such  a  portfolio  is  a  function  of  the  various 
returns  of  the  assets  and  of  the  weights  c  —  {c\,  .  .  .  ,  c^)T.  In  this  chapter  we 
investigate  the  “optimal  choice”  of  the  portfolio  weights  c.  The  optimality  criterion 
is  the  mean- variance  efficiency  of  the  portfolio.  Usually  investors  are  risk- averse, 
therefore,  we  can  define  a  mean-variance  efficient  portfolio  to  be  a  portfolio  that 
has  a  minimal  variance  for  a  given  desired  mean  return.  Equivalently,  we  could 
try  to  optimise  the  weights  for  the  portfolios  with  maximal  mean  return  for  a 
given  variance  (risk  structure).  We  develop  this  methodology  in  the  situations  of 
(non)existence  of  riskless  assets  and  discuss  relations  with  the  capital  asset  pricing 
model  (CAPM). 


19.1  Portfolio  Choice 


Suppose  that  one  has  a  portfolio  of  p  assets.  The  price  of  asset  j  at  time  i  is  denoted 
as  pij.  The  return  from  asset  j  in  a  single  time  period  (day,  month,  year  etc.)  is: 


Pij  ~  Pi-U 
Pi-hj 


We  observe  the  vectors  Xj  =  (xp , . . . ,  XiP)T  (i.e.  the  returns  of  the  assets  which  are 
contained  in  the  portfolio)  over  several  time  periods.  We  stack  these  observations 
into  a  data  matrix  A  =  (jCy)  consisting  of  observations  of  a  random  variable 


X  -  (/x,  £). 
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The  return  of  the  portfolio  is  the  weighted  sum  of  the  returns  of  the  p  assets: 

Q  =  cTX,  (19.1) 

where  c  —  (c  i, . . . ,  c^)T  (with  YHj=\  cj  —  1)  denotes  the  proportions  of  the  assets 
in  the  portfolio.  The  mean  return  of  the  portfolio  is  given  by  the  expected  value 
of  Q ,  which  is  cT /x.  The  risk  or  variance  ( squared  volatility )  of  the  portfolio  is 
given  by  the  variance  of  Q  (Theorem  4.6),  which  is  equal  to  two  times 

1  T 

-  ct£c.  (19.2) 

The  reason  for  taking  half  of  the  variance  of  Q  is  merely  technical.  The  optimisation 
of  (19.2)  with  respect  to  c  is  of  course  equivalent  to  minimising  cT  He.  Our  aim  is  to 
maximise  the  portfolio  returns  (19.1)  given  a  bound  on  the  volatility  (19.2)  or  vice 
versa  to  minimise  risk  given  a  (desired)  mean  return  of  the  portfolio. 


'  -ft  Summary 

Given  a  matrix  of  returns  X  from  p  assets  in  n  time  periods,  and 
that  the  underlying  distribution  is  stationary,  i.e.  X  ~  (/x,  E),  then 
the  (theoretical)  return  of  the  portfolio  is  a  weighted  sum  of  the 
returns  of  the  p  assets,  namely  0  =  cTX. 

^  The  expected  value  of  Q  is  cT/x.  For  technical  reasons  one 
considers  optimising  ^  cT£c.  The  risk  or  squared  volatility  is 
ct£c  =  Var  (cTX). 

^  The  portfolio  choice,  i.e.  the  selection  of  c,  is  such  that  the  return 
is  maximised  for  a  given  risk  bound. 


19.2  Efficient  Portfolio 

A  variance  efficient  portfolio  is  one  that  keeps  the  risk  (19.2)  minimal  under  the 
constraint  that  the  weights  sum  to  1 ,  i.e.  cT  lp  —  1 .  For  a  variance  efficient  portfolio, 
we  therefore  try  to  find  the  value  of  c  that  minimises  the  Lagrangian 

C  —  -  cTXc  —  X{cT \p  -  1).  (19.3) 

A  mean- variance  efficient  portfolio  is  defined  as  one  that  has  minimal  variance 
among  all  portfolios  with  the  same  mean.  More  formally,  we  have  to  find  a  vector 
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of  weights  c  such  that  the  variance  of  the  portfolio  is  minimal  subject  to  two 
constraints: 

1 .  a  certain,  pre-specified  mean  return  ft  has  to  be  achieved, 

2.  the  weights  have  to  sum  to  one. 

Mathematically  speaking,  we  are  dealing  with  an  optimisation  problem  under  two 
constraints. 

The  Lagrangian  function  for  this  problem  is  given  by 

C  —  ctEc  +  Ai(/x  —  cT  /i)  +  A2O  —  c1  \p). 

With  tools  presented  in  Sect.  2.4  we  can  calculate  the  first  order  condition  for  a 
minimum: 


dC 

—  =  2  Sc  —  A 1  /x  —  A2 1  n  =  0.  (19.4) 

ac 

Example  19.1  Figure  19.1  shows  the  monthly  returns  from  January  2000  to 
December  2009  of  six  stocks.  The  data  is  from  Yahoo  Finance.  For  each  stock 
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Fig.  19.1  Returns  of  six  firms  from  January  2000  to  December  2009  MVAre  turns 
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we  have  chosen  the  same  scale  on  the  vertical  axis  (which  gives  the  return  of 
the  stock).  Note  how  the  return  of  some  stocks,  such  as  Forward  Industries  and 
Apple,  are  much  more  volatile  than  the  returns  of  other  stocks,  such  as  IBM  or 
Consolidated  Edison  (Electric  utilities). 

As  a  very  simple  example  consider  two  differently  weighted  portfolios 
containing  only  two  assets,  IBM  and  Forward  Industries.  Figure  19.2  displays  the 
monthly  returns  of  the  two  portfolios.  The  portfolio  in  the  upper  panel  consists  of 
approximately  10%  Forward  Industries  assets  and  90%  IBM  assets.  The  portfolio 
in  the  lower  panel  contains  an  equal  proportion  of  each  of  the  assets.  The  text 
windows  on  the  right  of  Fig.  19.2  show  the  exact  weights  which  were  used.  We  can 
clearly  see  that  the  returns  of  the  portfolio  with  a  higher  share  of  the  IBM  assets 
(which  have  a  low  variance)  are  much  less  volatile. 

For  an  exact  analysis  of  the  optimisation  problem  (19.4)  we  distinguish  between 
two  cases:  the  existence  and  nonexistence  of  a  riskless  asset.  A  riskless  asset  is  an 
asset  such  as  a  zero  bond,  i.e.  a  financial  instrument  with  a  fixed  nonrandom  return 
(Franke,  Hardle  &  Hafner,  2011). 


Equally  Weighted  Portfolio 


X 


Weights 


0.500,  IBM 
0.500,  Ford 


Optimal  Weighted  Portfolio 


Weights 


0.895,  IBM 
0.105,  Ford 
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Nonexistence  of  a  Riskless  Asset 


Assume  that  the  covariance  matrix  E  is  invertible  (which  implies  positive 
definiteness).  This  is  equivalent  to  the  nonexistence  of  a  portfolio  c  with  variance 
cTEc  =  0.  If  all  assets  are  uncorrelated,  E  is  invertible  if  all  of  the  asset  returns 
have  positive  variances.  A  riskless  asset  (uncorrelated  with  all  other  assets)  would 
have  zero  variance  since  it  has  fixed,  nonrandom  returns.  In  this  case  E  would  not 
be  positive  definite. 

The  optimal  weights  can  be  derived  from  the  first  order  condition  (19.4)  as 

c  =  X-T,-\Xui  +  X2lp).  (19.5) 

Multiplying  this  by  a  (p  x  1)  vector  lp  of  ones,  we  obtain 

i  =  ijc  =  bjxr'CAi/z  +  x2\p), 

which  can  be  solved  for  A  2  to  get: 

2-AjlJS-V 

A  7  —  - =f - ; - . 

Plugging  this  expression  into  (19.5)  yields 


1 


iJs-V 


+ 


E-'l, 
1T  s— 1 1 

P 


(19.6) 


For  the  case  of  a  variance  efficient  portfolio  there  is  no  restriction  on  the  mean  of 
the  portfolio  (Ai  =0).  The  optimal  weights  are  therefore: 


iT  s— 1 1 

p 


(19.7) 


This  formula  is  identical  to  the  solution  of  (19.3).  Indeed,  differentiation  with 
respect  to  c  gives 


E  c  —  Al  p 
c  —  AE-1 1  p. 
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If  we  plug  this  into  (19.3),  we  obtain 

c=l-x2\Tpjrl\p-X{X\Tpjri\p-\) 

=  X-l-X2lTp'Z-1lp. 

This  quantity  is  a  function  of  A  and  is  minimal  for 


since 


d2C 
9cT  dc 


=  X  >  0. 


Theorem  19.1  The  variance  efficient  portfolio  weights  for  returns  X  ~  (/x,  X)  are 


X-1 1 


^opt  — 


p 


ijs-V 


09.8) 


Existence  of  a  Riskless  Asset 

If  an  asset  exists  with  variance  equal  to  zero,  then  the  covariance  matrix  X  is  not 
invertible.  The  notation  can  be  adjusted  for  this  case  as  follows:  denote  the  return  of 
the  riskless  asset  by  r  (under  the  absence  of  arbitrage  this  is  the  interest  rate),  and 
partition  the  vector  and  the  covariance  matrix  of  returns  such  that  the  last  component 
is  the  riskless  asset.  Thus,  the  last  equation  of  the  system  (19.4)  becomes 


2 Co v(r,  X)  —  X\r  —  A2  =  0, 


and,  because  the  covariance  of  the  riskless  asset  with  any  portfolio  is  zero,  we  have 

A2  =  -rAi.  (19.9) 

Let  us  for  a  moment  modify  the  notation  in  such  a  way  that  in  each  vector  and  matrix 
the  components  corresponding  to  the  riskless  asset  are  excluded.  For  example,  c 
is  the  weight  vector  of  the  risky  assets  (i.e.  assets  with  positive  variance),  and  Co 
denotes  the  proportion  invested  in  the  riskless  asset.  Obviously,  Co  —  1  —  1  Jc,  and  X 
the  covariance  matrix  of  the  risky  assets,  is  assumed  to  be  invertible.  Solving  (19.4) 
using  (19.9)  gives 


c  = 


2 


(19.10) 
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This  equation  may  be  solved  for  X\  by  plugging  it  into  the  condition  pT  c  =  jl. 
This  is  the  mean- variance  efficient  weight  vector  of  the  risky  assets  if  a  riskless  asset 
exists.  The  final  solution  is: 


fiTX~l(p  -  rlp)' 


(19.11) 


The  variance  optimal  weighting  of  the  assets  in  the  portfolio  depends  on  the 
structure  of  the  covariance  matrix  as  the  following  corollaries  show. 

Corollary  19.1  A  portfolio  of  uncorrelated  assets  whose  returns  have  equal 
variances  (E  =  o2Xp )  needs  to  be  weighted  equally: 


Proof  Here  we  obtain  ljE  1  \p  —  o  2ljlp  —  a  2 p  and  therefore  c  —  °0~2p  — 

p~1ip.  □ 

Corollary  19.2  A  portfolio  of  correlated  assets  whose  returns  have  equal  vari¬ 
ances,  i.e. 


1 


<  p  <  1 


needs  to  be  weighted  equally: 


^opt  —  P  1/7- 


Proof  E  can  be  rewritten  as  E  =  a2  j(l  —  p)Xp  +  p\p  1 J  j  .  The  inverse  is 


E-1  = 


X 


p 


pl  p  1 


T 

P 


a2(l  -  p)  a2(l  -  p){  1  +  (p-  1  )p} 


since  for  a  (p  x  p)  matrix  A  of  the  form  A  —  (a  —  b)Xp  +  b  lplj  the  inverse  is 
generally  given  by 


Ip _ b  lplj 

(< a  —b)  {a  —  b){a  +  (p  —  1  )b) 
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Hence 


_  1  p 

P  o2(\-p)  a2(l  -p){l  +  (p  -  l)p} 

=  [{1  +(p~  l)p}-pp]lp  =  {1  -p}lP 

a2(l  -  p){  1  +  (p  -  l)p}  a2(l  -  p){l  +  (p  -  1  )p} 


a2 {l  +  (p-  l)p} 


which  yields 


p 

a2{l  +  (p  -  l)p} 


and  thus  c  =  p  1 1^.  □ 

Let  us  now  consider  assets  with  different  variances.  We  will  see  that  in  this  case 
the  weights  are  adjusted  to  the  risk. 

Corollary  19.3  A  portfolio  of  uncorrelated  assets  with  returns  of  different  vari¬ 
ances ,  i.e.  X  =  diag(<72, . . . ,  c>2),  has  the  following  optimal  weights 

GJ2 

Cj, opt  —  —  9  j  —  •  •  •  >  P- 

E^r2 

/= i 

Proof  From  X-1  =  diag(<7f2, . . . ,  of2)  we  have  ljx_1l^  =  ^2f=lcrj~2  and 

p 

therefore  the  optimal  weights  are  Cj  —  <Jf2/Yl  a/_2'  □ 

/= l 

This  result  can  be  generalised  for  covariance  matrices  with  block  structures. 

Corollary  19.4  A  portfolio  of  assets  with  returns  X  ~  ( p ,  E),  where  the 

covariance  matrix  has  the  form: 


/Ei  0  ...  0  \ 

0  E2  : 

•  •  •  . 

\  0  ...  0  Xr  / 


X  = 
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has  optimal  weights  c  —  (c\, . . . ,  cr)T  given  by 


Cj,  opt  — 


SJ1! 


1TE 


•  _  1 

-1  j  ’  J  ~  E  •  •  •  ’ 


'  Summary 

An  efficient  portfolio  is  one  that  keeps  the  risk  minimal  under  the 
constraint  that  a  given  mean  return  is  achieved  and  that  the  weights 
sum  to  1,  i.e.  that  minimises  C  —  cTSc  +  Ai(/x  —  cT p)  +  A2 (1  — 

CT\p). 

If  a  riskless  asset  does  not  exist,  the  variance  efficient  portfolio 
weights  are  given  by 


c  — 


S-‘l  p 

ijs-n  / 


If  a  riskless  asset  exists,  the  mean-variance  efficient  portfolio 
weights  are  given  by 


/xE  1  (/x  —  rip) 
/xTE_1(/x  -  rlp)' 


The  efficient  weighting  depends  on  the  structure  of  the  covariance 
matrix  E.  Equal  variances  of  the  assets  in  the  portfolio  lead  to 
equal  weights,  different  variances  lead  to  weightings  proportional 
to  these  variances: 


Cj,  opt  — 
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19.3  Efficient  Portfolios  in  Practice 

We  can  now  demonstrate  the  usefulness  of  this  technique  by  applying  our  method  to 
the  monthly  market  returns  computed  on  the  basis  of  transactions  at  the  New  York 
stock  market  and  the  NASDAQ  stock  market  between  January  2000  and  December 
2009. 

Example  19.2  Recall  that  we  had  shown  the  portfolio  returns  with  uniform  and 
optimal  weights  in  Fig.  19.2.  The  covariance  matrix  of  the  returns  of  IBM  and 
Forward  Industries  is 


/  0.0073  0.0023  \ 

V  0.0023  0.0454  )  ' 


Hence  by  (19.7)  the  optimal  weighting  is 

g— l  y 

c  =  T  =  (0.8952,  0.1048)t 
l2S  ll2 

The  effect  of  efficient  weighting  becomes  even  clearer  when  we  expand  the 
portfolio  to  six  assets.  The  covariance  matrix  for  the  returns  of  all  six  firms 
introduced  in  Example  19.1  is 


/ 

7.3 

6.2 

3.1 

2.3  -0.1 

5.2^ 

6.2  23.9 

4.3 

2.1 

0.4 

6.4 

1(T3 

3.1 

4.3 

19.5 

-0.9 

1.1 

3.7 

2.3 

2.1 

1 

o 

45.4  -2.1 

0.8 

-0.1 

0.4 

1.1 

-2.1 

2.4 

-0.1 

5.2 

6.4 

3.7 

o 

oo 

1 

© 

14.7  / 

Hence  the  optimal  weighting  is 


S~ll6 


(0.1894,  -0.0139, 0.0094, 0.0580, 0.71 12, 0.0458)T. 


As  we  can  clearly  see,  the  optimal  weights  are  quite  different  from  the  equal 
weights  (cj  =  1/6).  The  weights  which  were  used  are  shown  in  text  windows  on 
the  right  hand  side  of  Fig.  19.3. 

This  efficient  weighting  assumes  stable  covariances  between  the  assets  over  time. 
Changing  covariance  structure  over  time  implies  weights  that  depend  on  time  as 
well.  This  is  part  of  a  large  body  of  literature  on  multivariate  volatility  models. 
For  a  review  refer  to  Franke  et  al.  (2011). 
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Fig.  19.3  Portfolio  of  all  six 
assets,  equal  and  efficient 
weights  Q  MVAportfol 


Equally  Weighted  Portfolio 

CO  _ 

o 


0  20  40  60  80  100  120 

X 


Weights 


0.167,  IBM 
0.167,  Apple 
0.167,  BAC 
0.167,  Ford 
0.167,  Edison 
0.167,  Stanley 


Optimal  Weighted  Portfolio 

CO  _ 
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0  20  40  60  80  100  120 

X 


Weights 


0.189,  IBM 
-0.014,  Apple 
0.009,  BAC 
0.058,  Ford 
0.71 1 ,  Edison 
0.046,  Stanley 


'  Summary 

^  Efficient  portfolio  weighting  in  practice  consists  of  estimating  the 
covariances  of  the  assets  in  the  portfolio  and  then  computing 
efficient  weights  from  this  empirical  covariance  matrix. 

^  Note  that  this  efficient  weighting  assumes  stable  covariances 
between  the  assets  over  time. 


19.4  The  Capital  Asset  Pricing  Model 

The  CAPM  considers  the  relation  between  a  mean- variance  efficient  portfolio  and 
an  asset  uncorrelated  with  this  portfolio.  Let  us  denote  this  specific  asset  return  by 
To-  The  riskless  asset  with  constant  return  y0  —  r  may  be  such  an  asset.  Recall 
from  (19.4)  the  condition  for  a  mean- variance  efficient  portfolio: 


2  TiC  —  X\/jl  —  A2I  p  —  0. 
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In  order  to  eliminate  A 2,  we  can  multiply  (19.4)  by  cT  to  get: 


2cT£c  —  Ai  (i  —  A; 


Plugging  this  into  (19.4),  we  obtain: 

2£c  —  X\  /jl  —  2cTY<clp  —  X\jllp 

li  =  jllp  +  ^-(Sc  -  c^Hclp).  (19.12) 

A 1 

For  the  asset  that  is  uncorrelated  with  the  portfolio,  Eq.  (19.12)  can  be  written  as: 

2  T 

Fo  =  M  -  t-c  He 

Ai 

since  yo  =  tr  is  the  mean  return  of  this  asset  and  is  otherwise  uncorrelated  with  the 
risky  assets.  This  yields: 


cTEc 
A -Jo 


and  if  (19.13)  is  plugged  into  (19.12): 


(19.13) 


A 1  =  Ml P  +  ^  _^°  (Sc  -  cTScl;?) 

c  1  Sc 

He  , 

m  =  yoip  +  0) 

C  2jC 

/i  =  y0\p  +  fiiji  -  y0)  (19.14) 


with 


n  def  Sc 

0  = 

c  1  Sc 

The  relation  (19.14)  holds  if  there  exists  any  asset  that  is  uncorrelated  with 
the  mean- variance  efficient  portfolio  c.  The  existence  of  a  riskless  asset  is  not  a 
necessary  condition  for  deriving  (19.14).  However,  for  this  special  case  we  arrive  at 
the  well-known  expression 


M  =  r  1^  +  -  r),  (19.15) 

which  is  known  as  the  CAPM,  see  Franke  et  al.  (2011).  The  beta  factor  /3  measures 
the  relative  performance  with  respect  to  riskless  assets  or  an  index.  It  reflects  the 
sensitivity  of  an  asset  with  respect  to  the  whole  market.  The  beta  factor  is  close  to 


19.5  Exercises 


499 


1  for  most  assets.  A  factor  of  1 . 16,  for  example,  means  that  the  asset  reacts  in  relation 
to  movements  of  the  whole  market  (expressed  through  an  index  like  DAX  or  DOW 
JONES)  16  %  stronger  than  the  index.  This  is  of  course  true  for  both  positive  and 
negative  fluctuations  of  the  whole  market. 


'  Summary 

The  weights  of  the  mean- variance  efficient  portfolio  satisfy  2 Zc  — 
Ai/x  —  A  2 1  p  —  0. 

^  In  the  CAPM  the  mean  of  X  depends  on  the  riskless  asset  and  the 
pre-specified  mean  JZ  as  follows  \i  —  r\p  +  f)(JZ  —  r). 

^  The  beta  factor  /3  measures  the  relative  performance  with  respect 
to  riskless  assets  or  an  index  and  reflects  the  sensitivity  of  an  asset 
with  respect  to  the  whole  market. 


19.5  Exercises 


Exercise  19.1  Prove  that  the  inverse  of  A  —  (a  —  b)Tp  +  blplj}  is  given  by 


Ip _ b  lplj 

(< a  —  b)  (a  —  b){a  +  (p  —  1  )b) 


Exercise  19.2  The  empirical  covariance  between  the  120  returns  of  IBM  and 
Forward  Industries  is  0.0023  (see  Example  19.2).  Test  if  the  true  covariance  is  zero. 
Hint:  Use  Fisher's  Z -transform. 

Exercise  19.3  Explain  why  in  both  Figs.  19.2  and  19.3  the  portfolios  have  negative 
returns  just  before  the  end  of  the  series,  regardless  of  whether  they  are  optimally 
weighted  or  not!  (What  happened  in  the  mid  2007?) 

Exercise  19.4  Apply  the  method  used  in  Example  19.2  on  the  same  data 
(Table  22.5 )  including  also  the  Digital  Equipment  company.  Obviously  one  of 
the  weights  is  negative.  Is  this  an  efficient  weighting? 

Exercise  19.5  In  the  CAPM  the  f}  value  tells  us  about  the  performance  of  the 
portfolio  relative  to  the  riskless  asset.  Calculate  the  /3  value  for  each  single  stock 
price  series  relative  to  the  “riskless”  asset  IBM. 


Chapter  20 

Computationally  Intensive  Techniques 


It  is  generally  accepted  that  training  in  statistics  must  include  some  exposure  to  the 
mechanics  of  computational  statistics.  This  exposure  to  computational  methods  is  of 
an  essential  nature  when  we  consider  extremely  high-dimensional  data.  Computer- 
aided  techniques  can  help  us  to  discover  dependencies  in  high  dimensions  without 
complicated  mathematical  tools.  A  draftman’s  plot  (i.e.  a  matrix  of  pairwise 
scatterplots  like  in  Fig.  1.14)  may  lead  us  immediately  to  a  theoretical  hypothesis 
(on  a  lower  dimensional  space)  on  the  relationship  of  the  variables.  Computer-aided 
techniques  are  therefore  at  the  heart  of  multivariate  statistical  analysis. 

With  the  rapidly  increasing  amount  of  data  statistics  faces  a  new  challenge. 
While  in  the  twentieth  century  the  focus  was  on  the  mathematical  precision  of 
statistical  modelling,  the  twenty-first  century  relies  more  and  more  on  data  analytic 
procedures  that  provide  information  (even  for  extremely  large  data  bases)  on  the 
fingertip.  This  demand  on  fast  availability  of  condensed  statistical  information  has 
changed  the  statistical  paradigm  and  has  shifted  energy  from  mathematical  analysis 
to  computational  analysis  of  course  without  loosing  sight  of  the  statistical  core 
questions. 

In  this  chapter  we  first  present  the  concept  of  Simplicial  Depth — a  multivariate 
extension  of  the  data  depth  concept  of  Sect.  1.1.  We  then  present  Projection 
Pursuit — a  semiparametric  technique  which  is  based  on  a  one-dimensional,  flexible 
regression  or  on  the  idea  of  density  smoothing  applied  to  principal  component 
analysis  (PC A)  type  projections.  A  similar  model  is  underlying  the  Sliced  Inverse 
Regression  (SIR)  technique  which  we  discuss  in  Sect.  20.3. 

The  next  technique  is  called  support  vector  machines  (SVMs)  and  is  motivated 
by  non-linear  classification  (discrimination)  problems.  SVMs  are  classification 
methods  based  on  statistical  learning  theory.  A  quadratic  optimisation  problem 
determines  so-called  support  vectors  with  high  margin  that  guarantee  maximal 
separability.  Non-linear  classification  is  achieved  by  mapping  the  data  into  a  feature 
space  and  finding  a  linear  separating  hyperplane  in  this  feature  space.  Another 
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advanced  technique  is  CART — Classification  and  Regression  Trees,  a  decision  tree 
procedure  developed  by  Breiman,  Friedman,  Olshen,  and  Stone  (1984). 


20.1  Simplicial  Depth 

Simplicial  depth  generalises  the  notion  of  data  depth  as  introduced  in  Sect.  1.1. 
This  general  definition  allows  us  to  define  a  multivariate  median  and  to  visually 
present  high-dimensional  data  in  low  dimension.  For  univariate  data  we  have  well 
known  parameters  of  location  which  describe  the  centre  of  a  distribution  of  a  random 
variable  X.  These  parameters  are  for  example  the  mean 

1 

x  =  — 


E 


x 


l  9 


(20.1) 


or  the  mode 


/\ 

*mod  =  argmax  f(x), 

X 


where  /  is  the  estimated  density  function  of  X  (see  Sect.  1.3).  The  median 

[  *{(n+i)/2}  if  n  odd 

-X-med  —  \ 

[  X{n/2)  *(n/2+l)  otherwise, 

where  xp)  is  the  order  statistics  of  the  n  observations  xz-,  is  yet  another  measure  of 
location. 

The  first  two  parameters  can  be  easily  extended  to  multivariate  random  variables. 
The  mean  in  higher  dimensions  is  defined  as  in  (20.1)  and  the  mode  accordingly, 


/V 

*mod  =  arg  max  f  (x) 

X 


/V 

with  /  the  estimated  multidimensional  density  function  of  X  (see  Sect.  1.3).  The 
median  poses  a  problem  though  since  in  a  multivariate  sense  we  cannot  interpret  the 
element-wise  median 


Xmed,j 


*{(»+i)/2},y  if  n  odd 

X(n/2),j+X(n/2+\),j 


(20.2) 


as  a  point  that  is  “most  central”.  The  same  argument  applies  to  other  observations 
of  a  sample  that  have  a  certain  “depth”  as  defined  in  Sect.  1.1.  The  “fourths”  or  the 
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“extremes”  are  not  defined  in  a  straightforward  way  in  higher  (not  even  for  two) 
dimensions. 

An  equivalent  definition  of  the  median  in  one  dimension  is  given  by  the  simplicial 
depth .  It  is  defined  as  follows:  For  each  pair  of  datapoints  x\  and  Xj  we  generate  a 
closed  interval,  a  one-dimensional  simplex,  which  contains  xz  and  xj  as  border 
points.  Redefine  the  median  as  the  datapoint  xmed,  which  is  enclosed  in  the 
maximum  number  of  intervals: 


•Xmed  =  argmax#{/:, e  [x*,*/]}.  (20.3) 

i 

With  this  definition  of  the  median,  the  median  is  the  “deepest”  and  “most 
central”  point  in  a  data  set  as  discussed  in  Sect.  1.1.  This  definition  involves  a 
computationally  intensive  operation  since  we  generate  n(n  —  l)/2  intervals  for  n 
observations. 

In  two  dimensions,  the  computation  is  even  more  intensive  since  the  interval 
[xk,xi\  is  replaced  by  a  triangle  constructed  from  three  different  datapoints.  The 
median  as  the  deepest  point  is  then  defined  by  that  datapoint  that  is  covered  by 
the  maximum  number  of  triangles.  In  three  dimensions  triangles  become  pyramids 
formed  from  4  points  and  the  median  is  that  datapoint  that  lies  in  the  maximum 
number  of  pyramids. 

An  example  for  the  depth  in  two  dimensions  is  given  by  the  constellation  of 
points  given  in  Fig.  20.1.  If  we  build  for  example  the  triangle  of  the  points  1,  3,  5 
(denoted  as  A  135  in  Table  20.1),  it  contains  the  point  4.  From  Table  20.1  we  count 
the  number  of  coverages  to  obtain  the  simplicial  depth  values  of  Table  20.2. 


Simplicial  Depth  Example 


Fig.  20.1  Construction  of  simplicial  depth  Q  MVAsimdepl 
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Table  20.1  Coverages  for 
artificial  configuration  of 
points 


Triangle 

Coverages 

1 

A  123 

1 

2 

3 

2 

A  124 

1 

2 

4 

3 

A  125 

1 

2 

5 

4 

A  126 

1 

2 

3 

4 

6 

5 

A  134 

1 

3 

4 

6 

A  135 

1 

3 

4 

5 

7 

A  136 

1 

3 

6 

8 

A  145 

1 

4 

5 

9 

A  146 

1 

3 

4 

6 

10 

A  156 

1 

3 

4 

5 

6 

11 

A  234 

2 

3 

4 

12 

A  235 

2 

3 

4 

5 

13 

A  236 

2 

3 

4 

6 

14 

A  245 

2 

4 

5 

15 

A  246 

2 

4 

6 

16 

A  256 

2 

5 

6 

17 

A  345 

3 

4 

5 

18 

A  346 

3 

4 

6 

19 

A  356 

3 

5 

6 

20 

A  456 

4 

5 

6 

Table  20.2  Simplicial 
depths  for  artificial 
configuration  of  points 


Point 

1 

2 

3 

4 

5 

6 

Depth 

10 

10 

12 

14 

8 

8 

In  arbitrary  dimension  p ,  we  look  for  datapoints  that  lie  inside  a  simplex  (or 
convex  hull)  formed  from  p  +  1  points.  We  therefore  extend  the  definition  of  the 
median  to  the  multivariate  case  as  follows 


*med  =  argmax#{&o,  ■  •  • ,  kp;  xt  e  hull(xko, ... ,  xk  )}.  (20.4) 

i 

Here  ko, ...  ,kp  denote  the  indices  of  p  +  1  datapoints.  Thus  for  each  datapoint 
we  have  a  multivariate  data  depth.  If  we  compute  all  the  necessary  simplices 
hull(xk0, . . . ,  Xkp),  the  computing  time  will  unfortunately  be  exponential  as  the 
dimension  increases. 

In  Fig.  20.2  we  calculate  the  simplicial  depth  for  a  two-dimensional,  10  point 
distribution  according  to  depth.  It  contains  100  data  points  with  corresponding 
parameters  controlling  its  spread.  The  deepest  point,  the  two-dimensional  median, 
is  indicated  as  a  big  star  in  the  centre.  The  points  with  less  depth  are  indicated  via 
grey  shades. 
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Fig.  20.2  10  point 
distribution  according  to 
depth  with  the  median  shown 
as  a  big  star  in  the  centre  Q 
MVAsimdepex 


Simplicial  depth 


x 


Summary 

^  The  “depth”  of  a  datapoint  in  one  dimension  can  be  computed  by 
counting  all  (closed)  intervals  of  two  datapoints  which  contain  the 
datapoint 

The  “deepest”  datapoint  is  the  central  point  of  the  distribution,  the 
median 

^  The  “depth”  of  a  datapoint  in  arbitrary  dimension  p  is  defined  as 
the  number  of  simplices  (constructed  from  p  +  1  points)  covering 
this  point.  It  is  called  simplicial  depth 

A  multivariate  extension  of  the  median  is  to  take  the  “deepest” 
datapoint  of  the  distribution 

^  In  the  bivariate  case  we  count  all  triangles  of  datapoints  which 
contain  the  datapoint  to  compute  its  depth 


20.2  Projection  Pursuit 

“Projection  Pursuit”  stands  for  a  class  of  exploratory  projection  techniques. 
This  class  contains  statistical  methods  designed  for  analysing  high-dimensional 
data  using  low-dimensional  projections.  The  aim  of  projection  pursuit  is  to 


506 


20  Computationally  Intensive  Techniques 


reveal  possible  non-linear  and  therefore  interesting  structures  hidden  in  the  high¬ 
dimensional  data.  To  what  extent  these  structures  are  “interesting”  is  measured  by 
an  index.  Exploratory  Projection  Pursuit  (EPP)  goes  back  to  Kruskal  (1969,  1972). 
The  approach  was  successfully  implemented  for  exploratory  purposes  by  various 
other  authors.  The  idea  has  been  applied  to  regression  analysis,  density  estimation, 
classification  and  discriminant  analysis. 


Exploratory  Projection  Pursuit 

In  EPP,  we  try  to  find  “interesting”  low-dimensional  projections  of  the  data.  For 
this  purpose,  a  suitable  index  function  1(a),  depending  on  a  normalised  projection 
vector  of,  is  used.  This  function  will  be  defined  such  that  “interesting”  views 
correspond  to  local  and  global  maxima  of  the  function.  This  approach  naturally 
accompanies  the  technique  of  PC  A  of  the  covariance  structure  of  a  random  vector  A. 
In  PCA  we  are  interested  in  finding  the  axes  of  the  covariance  ellipsoid.  The  index 
function  1(a)  is  in  this  case  the  variance  of  a  linear  combination  aT X  subject  to  the 
normalising  constraint  aTa  =  1  (see  Theorem  1 1.2).  If  we  analyse  a  sample  with  a 
p -dimensional  normal  distribution,  the  “interesting”  high-dimensional  structure  we 
find  by  maximising  this  index  is  of  course  linear. 

There  are  many  possible  projection  indices,  for  simplicity  the  kernel  based  and 
polynomial  based  indices  are  reported.  Assume  that  the  p -dimensional  random 
variable  X  is  sphered  and  centred,  that  is,  E(A)  =  0  and  Var(X)  =  Xp.  This 
will  remove  the  effect  of  location,  scale,  and  correlation  structure.  This  covariance 
structure  can  be  achieved  easily  by  the  Mahalanobis  transformation  (3.26). 

Friedman  and  Tukey  (1974)  proposed  to  investigate  the  high-dimensional  distri¬ 
bution  of  X  by  considering  the  index 


n 

Ipr,h(<x)  =  n~l  J2  fhA<xTXi)  (20.5) 

i  =  1 

✓V 

where  fh,a  denotes  the  kernel  estimator  (see  Sect.  1.3) 

n 

fh.a  (z)  -  E  Kh  (*  -  “T  V  )  (20.6) 

j  =  1 

of  the  projected  data.  Note  that  (20.5)  is  an  estimate  of  /  f2(z)dz  where  z  —  aT X 
is  a  one-dimensional  random  variable  with  mean  zero  and  unit  variance.  If  the  high¬ 
dimensional  distribution  of  X  is  normal,  then  each  projection  z  —  aT X  is  standard 
normal  since  ||af||  =  1  and  since  X  has  been  centred  and  sphered  by,  e.g.  the 
Mahalanobis  transformation. 
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The  index  should  therefore  be  stable  as  a  function  of  a  if  the  high-dimensional 
data  is  in  fact  normal.  Changes  in  Ivr,h(a)  with  respect  to  a  therefore  indicate 
deviations  from  normality.  Hodges  and  Lehman  (1956)  showed  that,  given  a  mean 
of  zero  and  unit  variance,  the  (compact  support)  density  which  minimises  f  f 2  is 
uniquely  given  by 


f(z)  =  max{0,  c(b 2  -  z 2)}, 

where  c  =  3/(20  \/5)  and  b  —  \Z~5.  This  is  a  parabolic  density  function,  which  is 
equal  to  zero  outside  the  interval  (—y/5,  V5).  A  high  value  of  the  Friedman-Tukey 
index  indicates  a  larger  departure  from  the  parabolic  form. 

An  alternative  index  is  based  on  the  negative  of  the  entropy  measure,  i.e. 
f  —f  log  /.  The  density  for  zero  mean  and  unit  variance  which  minimises  the  index 


J  /log/ 

is  the  standard  normal  density,  a  far  more  plausible  candidate  than  the  parabolic 
density  as  a  norm  from  which  departure  is  to  be  regarded  as  “interesting”.  Thus 
in  using  f  f  log  /  as  a  projection  index  we  are  really  implementing  the  viewpoint 
of  seeing  “interesting”  projections  as  departures  from  normality.  Yet  another  index 
could  be  based  on  the  Fisher  information  (see  Sect.  6.2) 

J  iff  if. 

To  optimise  the  entropy  index,  it  is  necessary  to  recalculate  it  at  each  step  of  the 
numerical  procedure.  There  is  no  method  of  obtaining  the  index  via  summary 
statistics  of  the  multivariate  data  set,  so  the  workload  of  the  calculation  at  each 
iteration  is  determined  by  the  number  of  observations.  It  is  therefore  interesting  to 
look  for  approximations  to  the  entropy  index.  Jones  and  Sibson  (1987)  suggested 
that  deviations  from  the  normal  density  should  be  considered  as 

fix)  =  <p(x){  1  +  e(x)}  (20.7) 


where  the  function  s  satisfies 

J  <p(u)s(u)u~r  du  =  0,  for  r  =  0,1,2.  (20.8) 

In  order  to  develop  the  Jones  and  Sibson  (1987)  index  it  is  convenient  to  think  in 
terms  of  cumulants  a/3  =  /X3  =  E(Y3),  K4  =  114  =  E(Y4)  —  3  (see  Sect.  1.3).  The 
standard  normal  density  satisfies  K3  =  K4  =  0,  an  index  with  any  hope  of  tracking 
the  entropy  index  must  at  least  incorporate  information  up  to  the  level  of  symmetric 
departures  (/C3  or  K4  not  zero)  from  normality.  The  simplest  of  such  indices  is  a 


508 


20  Computationally  Intensive  Techniques 


positive  definite  quadratic  form  in  k3  and  K\.  It  must  be  invariant  under  sign-reversal 
of  the  data  since  both  afT  A  and  —  aT  X  should  show  the  same  kind  of  departure  from 
normality.  Note  that  k3  is  odd  under  sign-reversal,  i.e.  K3(aT  X)  —  —K3(—aTX). 
The  cumulant  K\  is  even  under  sign-reversal,  i.e.  K4(aT  X)  —  a:4 ( — aT X).  The 
quadratic  form  in  k3  and  K4  measuring  departure  from  normality  cannot  include 
a  mixed  /C3/C4  term. 

For  the  density  (20.7)  one  may  conclude  with  (20.8)  that 

/  /(«)l°g(M)d«  /  <p(u)s(u)du. 

Now  if  /  is  expressed  as  a  Gram-Charlier  expansion 

f(x)<p(x)  =  {1  +  k3H3(x)/6  +  k4H4(x)/24  +  •••}  (20.9) 

(Kendall  &  Stuart,  1977,  p.  169)  where  Hr  is  the  r-th  Hermite  polynomial,  then 
the  truncation  of  (20.9)  and  use  of  orthogonality  and  normalisation  properties  of 
Hermite  polynomials  with  respect  to  ip  yields 

-  J  (p(x)s2(x)dx  =  (/cf  +  K4/4)  /12. 

The  index  proposed  by  Jones  and  Sibson  (1987)  is  therefore 

7JS(a)  =  {^(aTX)  +  K2A{aTX)/ 4}/12. 

This  index  measures  in  fact  the  negative  entropy  difference  f  f  log  f  —  f  <p  log  99. 

Example  20.1  The  EPP  is  used  on  the  Swiss  bank  note  data.  For  50  randomly 
chosen  one-dimensional  projections  of  this  six-dimensional  dataset  we  calculate  the 
Friedman-Tukey  index  to  evaluate  how  “interesting”  their  structures  are. 

Figure  20.3  shows  the  density  for  the  standard,  normally  distributed  data  (green) 
and  the  estimated  densities  for  the  best  (red)  and  the  worst  (blue)  projections  found. 
A  dotplot  of  the  projections  is  also  presented.  In  the  lower  part  of  the  figure  we 
see  the  estimated  value  of  the  Friedman-Tukey  index  for  each  computed  projection. 
From  this  information  we  can  judge  the  non  normality  of  the  bank  note  data  set 
since  there  is  a  lot  of  variation  across  the  50  random  projections. 


Projection  Pursuit  Regression 

The  problem  in  projection  pursuit  regression  is  to  estimate  a  response  surface 


/(x)  =  E(T  I  x) 


20.2  Projection  Pursuit 
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Fig.  20.3  Exploratory  Projection  Pursuit  for  the  Swiss  bank  notes  data  ( green  =  standard  normal, 
red  =  best,  blue  =  worst)  Q  MVAppexample 


via  approximating  functions  of  the  form: 

M 

fix)  =  J^gkiAjx) 

k=  1 

with  non-parametric  regression  functions  gk  and  projection  indices  A&.  Given 
observations  {(xi,  ji), . . . ,  (xn,  yn)}  with  x7  e  Rp  and  e  R  the  basic  algorithm 
works  as  follows. 

1 .  Set  rj0)  =  yi  and  k  —  1 . 

2.  Minimise 

&  =  £ 
i  =  1 

where  is  an  orthogonal  projection  matrix  and  gk  is  a  non-parametric 
regression  estimator. 

3.  Compute  new  residuals 


(A)  =  (k- 1)  _ 
i  '  i 


gk(Ajxi). 
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4.  Increase  k  and  repeat  the  last  two  steps  until  E \  becomes  small. 

Although  this  approach  seems  to  be  simple,  we  encounter  some  problems.  One 
of  the  most  serious  is  that  the  decomposition  of  a  function  into  sums  of  functions  of 
projections  may  not  be  unique.  An  example  is 


Z\Z2  — 


1 

4  ab 


{{az\  +  bzi)2 


(azi  -  bzi)2}. 


Numerical  improvements  of  this  algorithm  were  suggested  by  Friedman  and 
Stuetzle  (1981). 


'  Summary 

^  Exploratory  Projection  Pursuit  is  a  technique  used  to  find  inter¬ 
esting  structures  in  high-dimensional  data  via  low-dimensional 
projections.  Since  the  Gaussian  distribution  represents  a  standard 
situation,  we  define  the  Gaussian  distribution  as  the  most  uninter¬ 
esting 

^  The  search  for  interesting  structures  is  done  via  a  projection  score 
like  the  Friedman-Tukey  index  /ft (a)  =  /  /2.  The  parabolic 
distribution  has  the  minimal  score.  We  maximise  this  score  over 
all  projections 

^  The  Jones-Sibson  index  maximises 

/js  (or)  =  {k3(g!TX)  +  K2A(aJ  X)  /  A)  / \2 


as  a  function  of  a 
^  The  entropy  index  maximises 


hiot)  — 


j  f(aT X)  log  f(uTX) 


where  /  is  the  density  of  a T  X 

In  Projection  Pursuit  Regression  the  idea  is  to  represent  the 
unknown  function  by  a  sum  of  non-parametric  regression  functions 
on  projections.  The  key  problem  is  in  choosing  the  number  of  terms 
and  often  the  interpretability 
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20.3  Sliced  Inverse  Regression 

SIR  is  a  dimension  reduction  method  proposed  by  Duan  and  Li  (1991).  The  idea  is 
to  find  a  smooth  regression  function  that  operates  on  a  variable  set  of  projections. 
Given  a  response  variable  Y  and  a  (random)  vector  X  e  of  explanatory 
variables,  SIR  is  based  on  the  model: 

Y  =  m(pJX,...,pJX,e),  (20.10) 

where  f>\ , . . . ,  fa  are  unknown  projection  vectors,  k  is  unknown  and  assumed  to  be 
less  than  p ,  m  :  MA  +  1  ->  R  is  an  unknown  function,  and  s  is  the  noise  random 
variable  with  E  (e  |  X)  =  0. 

Model  (20.10)  describes  the  situation  where  the  response  variable  Y  depends 
on  the  p -dimensional  variable  X  only  through  a  /: -dimensional  subspace.  The 
unknown  /3,  ’s,  which  span  this  space,  are  called  effective  dimension  reduction 
directions  (EDR-directions).  The  span  is  denoted  as  effective  dimension  reduction 
space  (EDR-space).  The  aim  is  to  estimate  the  base  vectors  of  this  space,  for  which 
neither  the  length  nor  the  direction  can  be  identified.  Only  the  space  in  which  they 
lie  is  identifiable. 

SIR  tries  to  find  this  /: -dimensional  subspace  of  which  under  the 
model  (20.10)  carries  the  essential  information  of  the  regression  between  X  and  Y . 
SIR  also  focuses  on  small  k ,  so  that  nonparametric  methods  can  be  applied  for  the 
estimation  of  m.  A  direct  application  of  nonparametric  smoothing  to  X  is  for  high 
dimension  p  generally  not  possible  due  to  the  sparseness  of  the  observations.  This 
fact  is  well  known  as  the  curse  of  dimensionality ,  see  Huber  (1985). 

The  name  of  SIR  comes  from  computing  the  inverse  regression  (IR)  curve.  That 
means  instead  of  looking  for  E  (Y  |  X  =  x),  we  investigate  E  (X  |  Y  =  y),  a  curve 
in  IR77  consisting  of  p  one-dimensional  regressions.  What  is  the  connection  between 
the  IR  and  the  SIR  model  (20.10)?  The  answer  is  given  in  the  following  theorem 
from  Li  (1991). 

Theorem  20.1  Given  the  model  (20.10)  and  the  assumption 


k 

Wb  eRp  :  E  (bT X  \  pj X  =  pj x, . . . ,  pj X  =  pj x)  =  c0  +  y ^CiPjx, 

i  =  I 

(20.11) 

the  centred  IR  curve  E(X  \Y  =  y)  —  E(?0  lies  in  the  linear  sub  space  spanned  by 
the  vectors  £/3/,  i  —  1, . . . ,  k,  where  X  =  Cov(X). 

Assumption  (20.1 1)  is  equivalent  to  the  fact  that  X  has  an  elliptically  symmetric 
distribution,  see  Cook  and  Weisberg  (1991).  Hall  and  Li  (1993)  have  shown  that 
assumption  (20.1 1)  only  needs  to  hold  for  the  EDR-directions. 
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It  is  easy  to  see  that  for  the  standardised  variable  Z  =  E  1^2{X  —  E(Z)}  the 
IR  curve  m\(y)  —  E(Z  |  Y  —  y)  lies  in  span(?7i, . . . ,  rjk),  where  rjj  =  E1/2/?;. 
This  means  that  the  conditional  expectation  m\(y)  is  moving  in  span(?7i, . . . ,  rjk) 
depending  on  y.  With  b  orthogonal  to  span(?7i it  follows  that 

bTmi(y)  =  0, 


and  further  that 


m\(y)m\(y)T  b  —  Co\/{m\(y)}b  =  0. 

As  a  consequence  Cov{E(Z  |  j)}  is  degenerated  in  each  direction  orthogonal  to  all 
EDR-directions  rji  of  Z.  This  suggests  the  following  algorithm. 

First,  estimate  Cov{mi(T)}  and  then  calculate  the  orthogonal  directions  of  this 
matrix  (for  example,  with  eigenvalue/eigenvector  decomposition).  In  general,  the 
estimated  covariance  matrix  will  have  full  rank  because  of  random  variability, 
estimation  errors  and  numerical  imprecision.  Therefore,  we  investigate  the  eigen¬ 
values  of  the  estimate  and  ignore  eigenvectors  having  small  eigenvalues.  These 
eigenvectors  fjj  are  estimates  for  the  EDR-direction  rji  of  Z.  We  can  easily  rescale 
them  to  estimates  /?/  for  the  EDR-directions  of  X  by  multiplying  by  E-1//2,  but  then 
they  are  not  necessarily  orthogonal.  SIR  is  strongly  related  to  PCA.  If  all  of  the  data 

falls  into  a  single  interval,  which  means  that  Cov{mi(T)}  is  equal  to  Cov(Z),  SIR 
coincides  with  PCA.  Obviously,  in  this  case  any  information  about  y  is  ignored. 

The  SIR  Algorithm 

The  algorithm  to  estimate  the  EDR-directions  via  SIR  is  as  follows: 

1.  Standardise  x: 


Zi  =  s  1/2(x,-  -x). 

2.  Divide  the  range  of  into  S  nonoverlapping  intervals  (slices)  Hs,s  =  1, . . . ,  S. 
ns  denotes  the  number  of  observations  within  slice  Hs ,  and  I hs  the  indicator 
function  for  this  slice: 


n 

ns  =  E  ^ns{y>)- 

i  =  1 

3.  Compute  the  mean  of  Zi  over  all  slices.  This  is  a  crude  estimate  mi  for  the  inverse 
regression  curve  m\\ 


n 

zs  =  ns~l  lHs(yi). 

i  =  1 


20.3  Sliced  Inverse  Regression 
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4.  Calculate  the  estimate  for  Covjmi  (y)}: 


5 

V  =  n~l  y ^TisZszJ. 

s  =  1 


5.  Identify  the  eigenvalues  A,  and  eigenvectors  fji  of  V. 

6.  Transform  the  standardised  EDR-directions  fji  back  to  the  original  scale.  Now 
the  estimates  for  the  EDR-directions  are  given  by 


Remark  20.1  The  number  of  different  eigenvalues  unequal  to  zero  depends  on  the 

A 

number  of  slices.  The  rank  of  V  cannot  be  greater  than  the  number  of  slices  —1  (the 
H  sum  up  to  zero).  This  is  a  problem  for  categorical  response  variables,  especially 
for  a  binary  response — where  only  one  direction  can  be  found. 


SIR  II 

In  the  previous  section  we  learned  that  it  is  interesting  to  consider  the  IR  curve, 
that  is,  E(X  \  y).  In  some  situations  however  SIR  does  not  find  the  EDR-direction. 
We  overcome  this  difficulty  by  considering  the  conditional  covariance  Cov(X  |  y) 
instead  of  the  IR  curve.  An  example  where  the  EDR  directions  are  not  found  via  the 
SIR  curve  is  given  below. 

Example  20.2  Suppose  that  (X\,  X2)r  ~  N( 0,X2)  and  Y  =  X\.  Then  E(X2  \ 
y)  —  0  because  of  independence  and  E(X\  \y)  =  0  because  of  symmetry.  Hence, 
the  EDR-direction  /3  =  (1, 0)T  is  not  found  when  the  IR  curve  E(X  \  y)  =  0  is 
considered. 

The  conditional  variance 

Var (Xi  |  Y  =  y)  =  E(V2 1  Y  =  y)  =  y, 

offers  an  alternative  way  to  find  /3.  It  is  a  function  of  y  while  Var(X2  |  y)  is  a 
constant. 

The  idea  of  SIR  II  is  to  consider  the  conditional  covariances.  The  principle  of 
SIR  II  is  the  same  as  before:  investigation  of  the  IR  curve  (here  the  conditional 
covariance  instead  of  the  conditional  expectation).  Unfortunately,  the  theory  of  SIR 
II  is  more  complicated.  The  assumption  of  the  elliptical  symmetrical  distribution  of 
X  has  to  be  more  restrictive,  i.e.  assuming  the  normality  of  X . 

Given  this  assumption,  one  can  show  that  the  vectors  with  the  largest  distance  to 
Cov(Z  |  Y  —  y)  —  E{Cov(Z  |  Y  —  y)}  for  all  y  are  the  most  interesting  for  the 
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EDR-space.  An  appropriate  measure  for  the  overall  mean  distance  is,  according  to 
Li  (1992), 


E  (II  [Cov(Z  \  Y  —  y)  —  E{Cov(Z  |  Y  —  y)}\b\\2) 

=  bT  E  (||  Co v(Z  |  y)  -  E{Co v(Z  |  y)}\ |2)  b.  (20.12) 

Equipped  with  this  distance,  we  conduct  again  an  eigensystem  decomposition,  this 
time  for  the  above  expectation  E  (|  |  Co v(Z  |  y)  —  E{Cov(Z  |  j)}|  |2).  Then  we  take 
the  rescaled  eigenvectors  with  the  largest  eigenvalues  as  estimates  for  the  unknown 
EDR-directions. 


The  SIR  II  Algorithm 

The  algorithm  of  SIR  II  is  very  similar  to  the  one  for  SIR,  it  differs  in  only 
two  steps.  Instead  of  merely  computing  the  mean,  the  covariance  of  each  slice 
has  to  be  computed.  The  estimate  for  the  above  expectation  (20.12)  is  calculated 
after  computing  all  slice  covariances.  Finally,  decomposition  and  rescaling  are 
conducted,  as  before. 

1.  Do  steps  1-3  of  the  SIR  algorithm. 

/V 

2.  Compute  the  slice  covariance  matrix  Vs : 

n 

Vs  =  (ns  -  l)-1  tfisiydzizj  -  nszszj. 

i  =  1 

3.  Calculate  the  mean  over  all  slice  covariances: 

v 

V  =n-lJ2nsVs. 

S=  1 


4.  Compute  an  estimate  for  (20.12): 


v  =  n~l  J2 » v  (T  -  v)2  -  n~l  J2n^s2  - 


.S'=  1 


.S'=  1 


5.  Identify  the  eigenvectors  and  eigenvalues  of  V  and  scale  back  the  eigenvectors. 
This  gives  estimates  for  the  SIR  II  EDR-directions: 


20.3  Sliced  Inverse  Regression 
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Fig.  20.4  SIR:  The  left  plots  show  the  response  versus  the  estimated  EDR-directions.  The  upper 
right  plot  is  a  three-dimensional  plot  of  the  first  two  directions  and  the  response.  The  lower  right 

/V 

plot  shows  the  eigenvalues  A /  ( asterisk )  and  the  cumulative  sum  (open  circle )  Q  MVAsirdata 


Example  20.3  The  result  of  SIR  is  visualised  in  four  plots  in  Fig.  20.4:  the  left  two 
show  the  response  variable  versus  the  first  respectively  second  direction.  The  upper 
right  plot  consists  of  a  three-dimensional  plot  of  the  first  two  directions  and  the 

/V 

response.  The  last  picture  shows  44,  the  ratio  of  the  sum  of  the  first  k  eigenvalues 
and  the  sum  of  all  eigenvalues,  similar  to  PC  A. 

The  data  are  generated  according  to  the  following  model: 

yi  —  pj xi  +  (Pj  xi)3  +  ^(plxi)  +  £i  > 

where  the  x\  ’s  follow  a  three-dimensional  normal  distribution  with  zero  mean,  the 
covariance  equal  to  the  identity  matrix,  /32  =  (1 ,  — 1 ,  — 1)T,  and  ft \  —  (1, 1, 1)T. 
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Fig.  20.5  Plot  of  the  true 
response  versus  the  true  first 
index.  The  monotonic  and  the 
convex  shapes  can  be  clearly 
seen  Q  MVAsirdata 
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Fig.  20.6  Plot  of  the  true 
response  versus  the  true 
second  index.  The  monotonic 
and  the  convex  shapes  can  be 
clearly  seen  Q  MVAsirdata 
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Si  is  standard,  normally  distributed  and  n  =  300.  Corresponding  to  model  (20.10), 
m(u,  v,  s)  —  u  +  u3  +  v2  +  s.  The  situation  is  depicted  in  Figs.  20.5  and  20.6. 

Both  algorithms  were  conducted  using  the  slicing  method  with  20  elements  in 
each  slice.  The  goal  was  to  find  fa  and  fa  with  SIR.  The  data  are  designed  such 
that  SIR  can  detect  fa  because  of  the  monotonic  shape  of  {fij Xi  +  (fij Xi)3},  while 
SIR  II  will  search  for  fa,  as  in  this  direction  the  conditional  variance  on  y  is  varying. 
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Table  20.3  SIR: 
EDR-directions  for  simulated 
data 


Table  20.4  SIR  II: 

EDR-directions  for  simulated 
data 


A 

A 

yv 

/Si 

Pi 

Pi 

0.452 

0.881 

0.040 

0.571 

-0.349 

-0.787 

0.684 

-0.320 

0.615 

yv 

yv 

yv 

Pi 

Pi 

Pi 

-0.272 

0.964 

-0.001 

0.670 

0.100 

0.777 

0.690 

0.244 

-0.630 

If  we  normalise  the  eigenvalues  for  the  EDR-directions  in  Table  20.3  such  that 
they  sum  up  to  one,  the  resulting  vector  is  (0.852, 0.086, 0.062).  As  can  be  seen  in 
the  upper  left  plot  of  Fig.  20.4,  there  is  a  functional  relationship  found  between  the 

/v  —j—  y\ 

first  index  /3  ,  x  and  the  response.  Actually,  and  /3\  are  nearly  parallel,  that  is,  the 

/v  __  y\ 

normalised  inner  product  /3i/{||/3i||||/3i||}  =  0.9894  is  very  close  to  one. 

The  second  direction  along  @2  is  probably  found  due  to  the  good  approximation, 
but  SIR  does  not  provide  it  clearly,  because  it  is  “blind”  with  respect  to  the  change 
of  variance,  as  the  second  eigenvalue  indicates. 

For  SIR  II,  the  normalised  eigenvalues  are  (0.706, 0.185,  0.108),  that  is,  about 
69  %  of  the  variance  is  explained  by  the  first  EDR-direction  (Table  20.4).  Here,  the 

y\  y\ 

normalised  inner  product  of  @2  and  /3 1  is  0.9992.  The  estimator  /3\  estimates  in  fact 
P2  of  the  simulated  model.  In  this  case,  SIR  II  found  the  direction  where  the  second 
moment  varies  with  respect  to  fijx  (Fig.  20.7). 

In  summary,  SIR  has  found  the  direction  which  shows  a  strong  relation  regarding 
the  conditional  expectation  between  fijx  and  y,  and  SIR  II  has  found  the  direction 
where  the  conditional  variance  is  varying,  namely,  x. 

The  behaviour  of  the  two  SIR  algorithms  is  as  expected.  In  addition,  we  have 
seen  that  it  is  worthwhile  to  apply  both  versions  of  SIR.  It  is  possible  to  combine 
SIR  and  SIR  II  (Cook  &  Weisberg,  1991;  Fi,  1991;  Schott,  1994)  directly,  or  to 
investigate  higher  conditional  moments.  For  the  latter  it  seems  to  be  difficult  to 
obtain  theoretical  results. 


1+1*4 


•  Summary 

>  SIR  serves  as  a  dimension  reduction  tool  for  regression  problems 

>  Inverse  regression  avoids  the  curse  of  dimensionality 

>  The  dimension  reduction  can  be  conducted  without  estimation  of 
the  regression  function  y  —  m  (x) 
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Summary  (continued) 

^  SIR  searches  for  the  effective  dimension  reduction  (EDR)  by 
computing  the  inverse  regression  IR 

^  SIR  II  uses  the  EDR  on  computing  the  inverse  conditional  variance 
^  SIR  might  miss  EDR  directions  that  are  found  by  SIR  II 
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Fig.  20.7  SIR  II  mainly  sees  the  direction  f>2-  The  left  plots  show  the  response  versus  the 
estimated  EDR-directions.  The  upper  right  plot  is  a  three-dimensional  plot  of  the  first  two 

/V 

directions  and  the  response.  The  lower  right  plot  shows  the  eigenvalues  A /  ( asterisk )  and  the 
cumulative  sum  ( open  circle )  G  MVAsir2data 


20.4  Support  Vector  Machines 


519 


20.4  Support  Vector  Machines 

The  purpose  of  this  section  is  to  introduce  one  of  the  most  promising  among  recently 
developed  multivariate  non-linear  statistical  techniques:  the  SVM.  The  SVM  is 
a  classification  method  that  is  based  on  statistical  learning  theory.  It  has  been 
successfully  applied  to  optical  character  recognition,  early  medical  diagnostics,  and 
text  classification.  One  application  where  SVMs  outperformed  other  methods  is 
electric  load  prediction  (EUNITE,  2001),  another  one  is  optical  character  recogni¬ 
tion  (Vapnik,  1995).  In  a  variety  of  applications  SVMs  produce  better  classification 
results  than  parametric  methods  (e.g.  logit  analysis)  and  are  outperforming  widely 
used  nonparametric  techniques,  such  as  neural  networks.  Here  we  apply  SVMs  to 
corporate  bankruptcy  analysis. 


Classification  Methodology 

In  order  to  illustrate  the  classification  methodology  we  focus  for  the  moment  on 
a  company  rating  example  that  we  will  treat  further  in  more  detail.  Investment 
risks  are  evaluated  via  the  default  probability  (PD)  for  a  company.  Each  company  is 
described  by  a  set  of  variables  (predictors)  x,  such  as  financial  ratios,  and  its  class  y 
that  can  be  either  y  =  —  1  (“successful”)  or  y  —  1  (“bankrupt”).  Financial  ratios  are 
constructed  from  the  variables  like  net  income,  total  assets,  interest  payments,  etc. 
A  training  set  represents  a  sample  of  data  for  companies  which  are  known  to  have 
survived  or  gone  bankrupt.  From  the  training  set  one  estimates  a  classifier  function 
/  that  is  then  applied  to  computing  PDs.  These  PDs  can  be  uniquely  translated  into 
a  company  rating. 

Classical  discriminant  analysis  is  based  on  the  assumption  that  each  group  of 
observations  is  normally  distributed  with  the  same  variance-covariance  matrix  but 
different  means.  Under  such  a  formulation  the  discriminating  function  will  be  linear, 
see  Theorem  14.2.  Figure  20.8  displays  this  situation:  if  some  linear  combination  of 
predictors  (called  Z-score  in  the  context  of  bankruptcy  analysis)  is  greater  than 
a  particular  threshold  value  zo  the  observation  under  consideration  is  regarded  as 
belonging  to  y  =  1;  if  Z  <  zo  the  observation  would  belong  to  y  =  —  1 
(successful).  One  can  change  the  labels  “-1,4-1”  to  the  more  standard  notation 
“0,1”.  The  current  labeling  is  done  only  for  mathematical  convenience. 

The  Z-score  is: 


Zi  —  a\Xn  4~  ci2xn  4~  ...  4~  apXiP  —  Xf , 

where  Xj  =  (xn, . . .  ,XiP)T  e  are  predictors  for  the  i- th  company.  The 
classification  based  on  the  Z-score  are  necessarily  linear  and,  therefore,  may  not 
handle  more  complex  situations  as  in  Fig.  20.9  when  non-linear  classifiers,  such  as 
those  generated  by  SVMs,  can  produce  better  results. 
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Fig.  20.8  A  linear 
classification  function  in  the 
case  of  linearly  separable  data 


Fig.  20.9  Different  linear 
classification  functions  (1) 
and  (2)  and  a  non-linear  one 
(3)  in  the  linearly 
non- separable  case 


xi 


Expected  vs.  Empirical  Risk  Minimisation 


A  non-linear  classifier  function  /  may  be  described  by  a  function  class  T.  T  is  fixed 
a  priori,  e.g.  it  can  be  the  class  of  linear  classifiers  (hyperplanes).  A  good  classifier 
optimises  some  criterion  that  tells  us  how  well  /  separates  the  classes.  As  in  (14.4) 
one  considers  the  minimisation  of  the  expected  risk: 


R 


1 

2 


I/O)  -  y\dF(x,y). 


(20.13) 


The  joint  distribution  F(x,y ),  however,  is  never  known  in  practical  applications  and 
must  be  estimated  from  the  training  set  {x7 ,  }”=1.  By  replacing  F(x,  y )  with  the 

empirical  cdf  Fn(x,  y)  one  obtains  the  empirical  risk: 


R(f)  = 


l 

n 


(20.14) 
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The  empirical  risk  is  an  average  value  of  loss  over  the  training  set,  while  the 
expected  risk  is  the  expected  value  of  loss  under  the  true  probability  measure.  The 
loss  is  given  by: 


0,  if  classification  is  correct, 

1 ,  if  classification  is  wrong. 

One  sees  here  that  it  is  convenient  to  work  with  the  labels  1, 1”  for  y.  The 
solutions  to  the  problems  of  expected  and  empirical  risk  minimisation: 


L(x,y)  =  1  | f(x)  -y\  = 


/opt  =  arg  min  R  (/)  , 


/e-F 


fn  =  arg  min  R  (/)  , 


(20.15) 

(20.16) 


generally  do  not  coincide  (Fig.  20.10),  although  converge  as  n  ->  oo  if  T  is  not 
too  large.  According  to  statistical  learning  theory  (Vapnik,  1995),  it  is  possible  to 

/V 

get  a  uniform  upper  bound  on  the  difference  between  R  (/)  and  R  (/)  via  the 
Vapnik-Chervonenkis  (VC)  theory.  The  VC  bound  states  that  there  is  a  function 
0  (monotone  increasing  in  h)  so  that  for  all  /  e  T  with  a  probability  1  —  r)\ 


R  (/)  <  R  (/)  +  4> 


h  log (rj) 
n  ’  n 


(20.17) 


Here  h  denotes  the  VC  dimension,  a  measure  of  complexity  of  the  involved  function 
class  T .  For  a  linear  classification  rule  g(x)  =  sign(xTw  +  b)\ 


h  log (??)  |  =  lh  (log  f)  -  log  | 
n'  n  \  V  n 


(20.18) 


Fig.  20.10  The  minima  /opt 

/V 

and  fn  of  the  expected  ( R ) 
and  empirical  ( R )  risk 
functions  generally  do  not 
coincide 


Risk 
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where  h  is  the  VC  dimension.  By  plotting  the  function  0  (u,  v)  =  {  —  u  •  log  2 u  + 

_ 1  /  o 

log  4  —  u}  for  small  w  one  sees  the  monotonicity  of  0  (u,  v ).  In  fact  one  can  show 
that 


if  and  only  if  2 n  ^  h.  For  a  linear  classifier  with  h  =  p  +  1  this  is  an  easy  condition 
to  meet. 

The  VC  dimension  of  a  set  T  of  functions  in  a  d  -dimensional  space  is  h  if  some 
function  /  G  T  can  shatter  h  objects  [%i  G  W1  ,i  —  1, . . . ,  h},  in  all  2h  possible 
configurations  and  no  set  { Xj  G  Rd ,  j  —  1 , ,q}  with  q  >  h,  exists  that  satisfies 
this  property.  For  example,  three  points  on  a  plane  ( d  —  2)  can  be  shattered  by  linear 
indicator  functions  in  2h  —  2?  —  8  ways,  whereas  4  points  can  not  be  shattered  in 
2q  —  24  =  16  ways.  Thus,  the  VC  dimension  of  the  set  of  linear  indicator  functions 
in  a  two-dimensional  space  is  h  =  3,  see  Fig.  20.11.  The  expression  for  the  VC 
bound  (20.17)  involves  the  VC  dimension  h,  a  parameter  controlling  complexity 

of  T .  The  term  0  j  ^ j  introduces  a  penalty  for  excessive  complexity  of  a 

classifier  function.  The  higher  is  the  complexity  of  /  G  T  the  higher  are  h  and 
therefore  0.  There  is  a  trade-off  between  the  number  of  classification  errors  on  the 
training  set  and  the  complexity  of  the  classifier  function.  If  the  complexity  were 
not  controlled  for,  it  would  be  possible  to  construct  a  classifier  function  with  no 
classification  errors  on  the  training  set  notwithstanding  how  low  its  generalisation 
ability  would  be. 
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Fig.  20.11  Eight  possible  ways  of  shattering  3  points  on  the  plane  with  a  linear  indicator  function 
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The  SVM  in  the  Linearly  Separable  Case 

First  we  will  describe  the  SVM  in  the  linearly  separable  case.  The  family  T  of 
classification  functions  in  the  data  space  is  given  by: 

J={iV  +  l),H'er,l)€l}  (20.19) 

In  order  to  determine  the  support  vectors  we  choose  /  e  T  (or  equivalently  (w,  b )) 
such  that  the  so-called  margin — the  corridor  between  the  separating  hyperplanes — 
is  maximal.  This  situation  is  illustrated  in  Fig.  20. 12.  The  margin  is  equal  to  d- + d+ . 
The  classification  function  is  a  hyperplane  plus  the  margin  zone,  where,  in  the 
separable  case,  no  observations  can  lie.  It  separates  the  points  from  both  classes  with 
the  highest  “safest”  distance  (margin)  between  them.  It  can  be  shown  that  margin 
maximisation  corresponds  to  the  reduction  of  complexity  as  given  by  the  VC- 
dimension  of  the  SVM  classifier.  Apparently,  the  separating  hyperplane  is  defined 
only  by  the  support  vectors  that  hold  the  hyperplanes  parallel  to  the  separating  one. 
In  Fig.  20.12  there  are  three  support  vectors  that  are  marked  with  bold  style:  two 
crosses  and  one  circle.  We  come  now  to  the  description  of  the  SVM  selection. 

Let  xTw  +  b  —  Obea  separating  hyperplane.  Then  d+  ( d _)  will  be  the  shortest 
distance  to  the  closest  objects  from  the  classes  +1  (—1).  Since  the  separation  can 
be  done  without  errors,  all  observations  i  —  1,2 , ,n  must  satisfy: 

xjw  +  b  >  + 1  for  yf  =  + 1 
xj  w  +  b  <  —  1  for  yi  —  —  1 


Fig.  20.12  The  separating 
hyperplane  xTw  +  b  =  0 
and  the  margin  in  the  linearly 
separable  case 
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We  can  combine  both  constraints  into  one: 


yi  (xT  w  -\-  b)  —  1  >  0  i  —  1,2, ...  ,n 


(20.20) 


The  canonical  hyperplanes  xjw  +  b  —  ±  1  are  parallel  and  the  distance  between 
each  of  them  and  the  separating  hyperplane  is  d+  —  d-  —  1/ 1|  w\\ .  To  maximise  the 
margin  d+  -yd-  —  2/||w||  one  therefore  minimises  the  Euclidean  norm  ||w||  or  its 
square  ||w||2. 

The  Lagrangian  for  the  primal  problem  that  corresponds  to  margin  maximisation 
subject  to  constraint  (20.20)  is: 


1  9 

LP  ( w,b )  =  -\  w\ 


n 


-  w+h)  - 1} 


(20.21) 


7  =  1 


The  Karush-Kuhn-Tucker  (KKT)  (Gale  et  al.,  1951)  first  order  optimality 
conditions  are: 


n 

"-'Eat**  =  0 
/  =  1 

n 

E  yi  =  0 

i  =  1 

yi  (xj  w  +  b)  —  1  >  0,  i  =  l, ...  ,n 
oti  >  0 

0ii{yi(xjw  +  b)  -  1}  =  0 

From  these  first  order  condition,  we  can  derive  w  =  YHl=  i  ai  yi  xi  an<^  therefore 
the  summands  in  (20.21)  read: 


-  E  (*2*  +  &)  -  a 

/  =  l 


=  2  E!  E! a'  ai yj  y  xi 


i  =  1  .7  =  1 

/7  n  n 

=  -  E  -v'  x<T  E  yj  xj  +  E  “«■ 

7—1  7=1  7=1 

77  77  77 

=  -  E  E  “«■  '■/  .'■/  v,T  xi  +  E  “>■ 

7=1 7=1  7=1 


3Lp 

Sw 

3LP 
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Substituting  this  into  (20.21)  we  obtain  the  Lagrangian  for  the  dual  problem: 


77 


n  n 


Ld  (a)  =  ^  'Y^aiajyi yjxJxj 


(20.22) 


7—1 


7=1 7=1 


The  primal  and  dual  problems  are: 


min  L  p  (w,  b ) 

w,b 

n 

ma xLp>  (a)  s.t.  otj  >  0,  /  oti  yi  —  0. 

a 

7=1 

Since  the  optimisation  problem  is  convex  the  dual  and  primal  formulations  give  the 
same  solution. 

Those  points  i  for  which  the  equation  yt  (xj w  +  b)  =  1  holds  are  called  support 
vectors.  After  “training  the  SVM”  i.e.  solving  the  dual  problem  above  and  deriving 
Lagrange  multipliers  (they  are  equal  to  0  for  non-support  vectors)  one  can  classify 
a  company.  One  uses  the  classification  rule: 

g(x)  =  sign  (xTw  +  b) ,  (20.23) 

where  w  —  YH=  i  aiyixi  and  b  =  \  (x+i  +  x_i)  w.  x+i  and  X-\  are  two  support 
vectors  belonging  to  different  classes  for  which  y(xTw  +  b)  =  1.  The  value  of  the 
classification  function  (the  score  of  a  company)  can  be  computed  as 

fix')  —  xTw  +  b.  (20.24) 

Each  score  / (x)  uniquely  corresponds  to  a  default  probability  (PD).  The  higher 
/ (x)  the  higher  the  PD. 


SVMs  in  the  Linearly  Non-separable  Case 

In  the  linearly  non-separable  case  the  situation  is  like  in  Fig.  20.13.  The  slack 
variables  represent  the  violation  from  strict  separation.  In  this  case  the  following 
inequalities  can  be  induced  from  Fig.  20.13: 

xjw  +  b  >  1  -  £,•  for  yt  =  1, 
xjw  +  b  <  -1  +  £/  for  yt  =  -1, 

£  >  0. 
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Fig.  20.13  The  separating 
hyperplane  xTw  +  b  =  0 
and  the  margin  in  the  linearly 
non-separable  case 


They  can  be  combined  into  two  constraints: 


yt  w  +  b)  >  l  -  & 

£■  >  o. 


(20.25) 

(20.26) 


SVM  classification  again  maximises  the  margin  given  a  family  of  classification 
functions  T . 

The  penalty  for  misclassification,  the  classification  error  &  >  0,  is  related  to 
the  distance  from  a  misclassified  point  x\  to  the  canonical  hyperplane  bounding  its 
class.  If  >  0,  an  error  in  separating  the  two  sets  occurs.  The  objective  function 
corresponding  to  penalised  margin  maximisation  is  then  formulated  as: 


1 

2 


+  c  X!  &  ’ 

/  =  i 


(20.27) 


where  the  parameter  C  characterises  the  weight  given  to  the  classification  errors. 
The  minimisation  of  the  objective  function  with  constraint  (20.25)  and  (20.26)  pro¬ 
vides  the  highest  possible  margin  in  the  case  when  classification  errors  are  inevitable 
due  to  the  linearity  of  the  separating  hyperplane.  Under  such  a  formulation  the 
problem  is  convex. 

The  Lagrange  function  for  the  primal  problem  is: 


LP  (w,b,%) 


n  n  n 

+  c  L  ^  ~  L !-v'  (x7w+ b)-i  + 

i= 1  i = 1  i = 1 


(20.28) 
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where  07  >  0  and  /x,  >  0  are  Lagrange  multipliers.  The  primal  problem  is 
formulated  as: 


min  Lp  (w,  b ,  £)  . 


The  first  order  conditions  in  this  case  are: 


9LP 

9w 

O 

II 

77 

vv  -  ^  O',  y,  A',  =  0 

7  =1 

9Lp 

9Z> 

O 

II 

O 

II 

S' 

7=1 

9Lp 

3?/ 

O 

II 

C  -  at-  iXj  =  0 

With  the  conditions  for  the  Lagrange  multipliers: 


oii  >  0 
Pi  >  0 

«/  {ji  (V W  +  6)  -  1  +  £,■}  =  0 

/X/  =  0 

Note  that  «/  b  =  0  therefore  similar  to  the  linear  separable  case  the  primal 
problem  translates  into: 


1  77  77  77  /7 

(a)  =  r  ^^aioijytyjxjxj  ~  'YloiiyixJ  ^ajyjXj 


7=1 7=1 


7=1 


7=1 


77  77 


77 


77 


+c  L?'  +  L"'  -  L  “<■  &  -  L  &  & 


7=1  7=1  7=1 


7=1 


77  |  77  77  77 

=  L  o  0,1  w v./.v/.vy  +  (C  -  a,-  -  7i;) 


7=1 


7=17=1 


7=1 


Since  the  last  term  is  0  we  derive  the  dual  problem  as: 


77  |  77  77 

Ld  (a)  =  -  7  X!  X! 


^/.V/.V/.V,  V/. 


7=1 


7=1 7=1 


(20.29) 
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and  the  dual  problem  is  posed  as: 


max  L d  (a) , 

a 


subject  to: 


0  <  OLi  <  C, 

n 

^2  )’i  =  0- 

7—1 


Non-linear  Classification 

The  SVMs  can  also  be  generalised  to  the  non-linear  case.  In  order  to  obtain  non¬ 
linear  classifiers  as  in  Fig.  20.14  one  maps  the  data  with  a  non-linear  structure  via  a 
function  tp  :  i->  HI  into  a  very  large  dimensional  space  HI  where  the  classification 

rule  is  (almost)  linear.  Note  that  all  the  training  vectors  x,  appear  in  Ld  (20.29) 
only  as  scalar  products  of  the  form  xj  Xj .  In  the  non-linear  S VM  situations  this 
transforms  to  t jr  (x,)T  x/r  ( Xj ). 

The  so-called  kernel  trick  is  to  compute  this  scalar  product  via  a  kernel  function. 
These  kernel  functions  are  actually  related  to  those  we  presented  in  Sect.  1.3. 
If  a  kernel  function  K  exists  such  that  K(xi,Xj )  =  T/(x,)tT/(x7),  then  it 
can  be  used  without  knowing  the  transformation  tp  explicitly.  A  necessary  and 
sufficient  condition  for  a  symmetric  function  K(xi,xj)  to  be  a  kernel  is  given 
by  Mercer’s  theorem  (Mercer,  1909).  It  requires  positive  definiteness,  i.e.  for 


Fig.  20.14  Mapping  into  a  three-dimensional  feature  space  from  a  two-dimensional  data  space 
R2  R3.  The  transformation  Tfixi ,  X2)  =  (x2,  \flx\X2,  vf)T  corresponds  to  the  kernel  function 
K(xi,Xj)  =  (xj  Xj)2 
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any  data  set  X\, ... ,  xn  and  any  real  numbers  X\, ...  ,Xn  the  function  K  must 
satisfy 


n  n 

y]  XiXj  K(xj  ,Xj)>  0.  (20.30) 

i  =  1  j  =  1 

Some  examples  of  kernel  functions  are: 

-  K(xi ,  Xj )  =  e~WXi~Xj  II ^°2 — the  isotropic  Gaussian  kernel  with  constant  a 

-  K(xi ,  Xj)  =  e~^Xi~Xj^  r  E  (Xi~xj)/2 — the  stationary  Gaussian  kernel  with  an 
anisotropic  radial  basis  with  constant  r  and  variance-covariance  matrix  E  from 
training  set 

-  K(xj ,  Xj )  =  (xj  Xj  +  1)^ — the  polynomial  kernel  of  degree  p 

-  K(xj ,  xj )  =  tanh^x^  Xj  —  8 ) — the  hyperbolic  tangent  kernel  with  constant  k 
and  8. 


SVMs  for  Simulated  Data 

The  basic  parameters  of  SVMs  are  on  the  scaling  r  of  the  anisotropic  radial  basis 
functions  (in  the  stationary  Gaussian  kernel)  and  the  capacity  C .  The  parameter  r 
controls  the  local  resolution  of  the  SVM  in  the  sense  that  smaller  r  create  smaller 
curvature  of  the  margin.  The  capacity  C  controls  the  amount  of  slack  to  allow  for 
unclassified  observations.  A  large  C  would  create  a  very  rough  and  curved  margin 
where  C  close  to  zero  makes  the  margin  more  smooth. 

One  of  the  guinea  pig  tests  for  a  classification  algorithm  is  the  data  described 
as  “orange  peel”,  i.e.  when  two  groups  of  observations  have  similar  means, 
their  variance,  however,  being  different.  The  classification  results  in  this  case  are 
presented  in  Fig.  20.15.  An  SVM  with  a  radial  basis  kernel  is  highly  suitable  for 
such  a  kind  of  data. 

Another  popular  non-linear  test  is  the  classification  of  “spiral  data”.  We  generated 
two  spirals  with  the  distance  between  them  equal  1.0  that  span  over  3tt  radian.  The 
SVM  was  chosen  with  r  =  0.1  and  C  =  10 /n.  The  SVM  was  able  to  separate  the 
classes  without  an  error  if  noise  with  parameters  Sj  ~  V(0,  0. 12X)  was  injected  into 
the  pure  spiral  data  (Fig.  20.16).  Obviously,  both  the  “orange  peel”  and  the  “spiral 
data”  are  not  linearly  separable. 


Solution  of  the  SVM  Classification  Problem 

The  standard  SVM  optimisation  problem  (20.29),  which  is  a  quadratic  optimisa¬ 
tion  problem,  is  usually  solved  by  means  of  quadratic  programming  (QP).  This 
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Fig.  20.15  SVM 

classification  results  for  the 
“orange  peel”  data,  n  =  200, 
d  =  2,  n — i  =  w_|_  i  =  100, 

x+U  ~Af((0,0)T,22I), 
x_u  ~TV((0,0)t,0.52I) 
with  SVM  parameters 
r  =  0.5  and  C  =  20/200 
Q  MVAsvmOrangePeel 


SVM  classification  plot 
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XI 


Fig.  20.16  SVM 

classification  results  for  the 
noisy  spiral  data.  The  spirals 
spread  over  3ic  radian;  the 
distance  between  the  spirals 
equals  1.0.  d  =  2, 
n~ i  =  n  + 1  =  100,  n  =  200. 
The  noise  was  injected  with 
the  parameters 
Si  ~  N(0,  0.12Z).  The 
separation  is  perfect  with 
SVM  parameters  r  =  0.1  and 
C  =  10/200 
Q  MVAsvmSpiral 


SVM  classification  plot 
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technique,  however,  is  notorious  for  (i)  its  bad  scaling  properties  (the  time  required 
to  solve  the  problem  is  proportional  to  n 3,  where  n  is  the  number  of  observations), 
(ii)  implementation  difficulty  and  (iii)  enormous  memory  requirements.  With  the 
QP  technique  the  whole  kernel  matrix  of  the  size  n  x  n  has  to  be  fit  in  the  memory, 
which,  assuming  that  each  variable  takes  up  10  bytes  of  memory,  will  require 
10  x  n  x  n  bytes.  This  means  that  1  million  observation  (which  is  not  unusual  for 
practical  applications  such  as  credit  scoring)  will  require  12,000  TBytes  (terabytes) 
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or  10,000,000  MBytes  of  operating  memory  to  store.  With  a  typical  size  of  the 
computer  memory  of  512  MBytes  no  more  than  around  5,000  observations  can 
be  processed.  Thus,  the  main  emphasis  in  designing  new  algorithms  was  made 
on  using  special  properties  of  SVMs  to  speed  up  the  solution  and  reduce  memory 
requirements. 


Scoring  Companies 

For  our  illustration  we  selected  the  largest  bankrupt  companies  with  the  capitalisa¬ 
tion  of  no  less  than  1  billion  USD.  The  dataset  used  in  this  work  is  from  the  Credit 
reform  database  provided  by  the  Research  Data  Center  (RDC)  of  the  Humboldt 
Universitat  zu  Berlin.  It  contains  financial  information  from  about  20,000  solvent 
and  1,000  insolvent  German  companies.  The  period  spans  from  1996  to  2002  and  in 
the  case  of  the  insolvent  companies  the  information  is  gathered  2  years  before  the 
insolvency  took  place.  The  last  annual  report  of  a  company  before  it  goes  bankrupt 
receives  the  indicator  y  —  1  and  for  the  rest  (solvent)  companies  y  =  —  1 . 

We  are  given  28  variables,  i.e.  cash,  inventories,  equity,  EBIT,  number  of 
employees,  and  branch  code.  From  the  original  data,  we  create  common  financial 
indicators  which  are  denoted  as  xl, . . . ,  x25.  These  ratios  can  be  grouped  into  four 
categories  such  as  profitability,  leverage,  liquidity,  and  activity. 

Obviously,  data  for  the  year  of  1996  are  missing  and  we  will  exclude  them  for 
further  calculations.  In  order  to  reduce  the  effect  of  the  outliers  on  the  results, 
all  observations  that  exceeded  the  upper  limit  of  IQ  (Inter-quartile  range)  or  the 
lower  limit  of  IQ  were  replaced  with  these  values.  To  demonstrate  how  performance 
changes,  we  will  use  the  Accounts  Payable  (AP)  turnover  (named  A 24)  and  ratio 
of  Operating  Income  (OI)  and  Total  Asset  (TA)  (named  A3).  We  choose  randomnly 
50  solvent  and  50  insolvent  companies.  The  statistical  description  of  financial  ratios 
is  summarized  in  Table  20.5. 

Keep  in  mind  that  different  kernels  will  influence  performance.  We  will  use  one 
of  the  most  common  ones,  the  isotropic  Gaussian  kernel.  Triangles  and  circles 
in  Fig.  20.17  represent  successful  and  failing  companies  from  the  training  set, 
respectively.  The  coloured  background  corresponds  to  different  score  values  / .  The 
more  blue  the  area,  the  higher  the  score  and  the  greater  the  probability  of  default. 
Most  successful  companies  lying  in  the  red  area  have  positive  profitability  and  a 
reasonable  activity. 

Figure  20.17  presents  the  classification  results  for  an  SVM  using  isotropic 
Gaussian  kernel  with  a  =  100  and  the  fixed  capacity  C  —  1.  With  given  priors, 
the  SVM  has  trouble  classifying  between  solvent  and  insolvent  company.  The  radial 
base  a,  which  determines  the  minimum  radius  of  a  group,  is  too  large.  Notice  that 
SVM  do  a  poor  job  of  distinguishing  between  groups  even  though  most  observations 
are  used  as  support  vector. 

The  applied  SVMs  differed  in  two  aspects:  (i)  their  capacity  that  is  controlled 
by  the  coefficient  C  in  (20.28)  and  (ii)  the  complexity  of  classifier  functions 
controlled  in  our  case  by  the  isotropic  radial  basis  in  the  Gaussian  kernel.  In 
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Fig.  20.17  Ratings  of 
companies  in  two  dimensions. 
Low  complexity  of  classifier 
functions  with  o  =  100  and 
C  =  1 .  Percentage  of 
misclassification  is  0.43 
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Table  20.5  Descriptive 
statistics  for  financial  ratios 
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Fig.  20.18  Ratings  of 
companies  in  two 
dimensions.  The  case  of  an 
average  complexity  of 
classifier  functions  with 
cr  =  2  and  capacity  is  fixed  at 
C  =  1 .  Percentage  of 
misclassification  is  reduced  to 
0.27  O  MVAsvmS ig2 Cl 
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Fig.  20.18  the  value  a  is  reduced  to  2  while  C  remains  the  same.  SVM  start 
recognising  the  difference  between  solvent  and  insolvent  companies  resulting  in 
sharper  cluster.  Figure  20.19  demonstrate  the  effect  of  the  changing  capacity  to  the 
classification  result.  The  optimisation  of  SVM  parameters  (C  and  a)  can  be  done 
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Fig.  20.19  Ratings  of  companies  in  two  dimensions.  High  capacity  (C  =  200)  with  radial  basis 
is  fixed  at  cr  =  0.5.  Percentage  of  misclassification  is  0.10  Q  MVAsvmSig05C2  0  0 


Rank  of  Score 


Fig.  20.20  Cumulative  accuracy  profile  (CAP)  curve 


by  using  grid  search  method  or  an  other  advance  algorithm  the  so-called  Genetic 
Algorithm. 

Figure  20.20  shows  a  Cumulative  Accuracy  Profile  (CAP)  curve  which  is 
particularly  useful  in  that  it  simultaneously  measures  Type  I  and  Type  II  errors. 
In  statistical  terms,  the  CAP  curve  represents  the  cumulative  probability  of  default 
events  for  different  percentiles  of  the  risk  score  scale.  Now,  we  introduce  Accuracy 
Ratio  (AR)  derived  from  CAP  curve  for  measuring  and  comparing  the  performance 
of  credit  risk  model.  Therefore,  AR  is  defined  as  the  ratio  of  the  area  between  a 
model  CAP  curve  and  the  random  curve  to  the  area  between  the  perfect  CAP  curve 
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and  the  random  CAP  curve  (see  Fig.  20.20).  Perfect  classification  is  attained  if  the 
value  of  AR  is  equal  to  one. 


'  -K»  Summary 

^  SVM  classification  is  done  by  mapping  the  data  into  feature  space 
and  finding  a  separating  hyperplane  there 

The  support  vectors  are  determined  via  a  quadratic  optimisation 
problem 

SVM  produces  highly  non-linear  classification  boundaries 


20.5  Classification  and  Regression  Trees 

Classification  and  Regression  Trees  (CART)  is  a  method  of  data  analysis  developed 
by  a  group  of  American  statisticians  (Breiman  et  al. ,  1984).  The  aim  of  CART  is  to 
classify  observations  into  a  subset  of  known  classes  or  to  predict  levels  of  regression 
functions.  CART  is  a  non-parametric  tool  which  is  designed  to  represent  decision 
rules  in  a  form  of  the  so-called  binary  trees.  Binary  trees  split  a  learning  sample 
parallel  to  the  coordinate  axis  and  represent  the  resulting  data  clusters  hierarchically 
starting  from  a  root  node  for  the  whole  learning  sample  itself  and  ending  with 
relatively  homogenous  buckets  of  observations. 

Regression  trees  are  constructed  in  a  similar  way  but  the  final  buckets  do  not 
represent  classes  but  rather  approximations  to  an  unknown  regression  functions  at 
a  particular  point  of  the  independent  variable.  In  this  sense  regression  trees  are 
estimates  via  a  non-parametric  regression  model.  Here  we  provide  an  outlook  of 
how  decision  trees  are  created,  what  challenges  arise  during  practical  applications 
and,  of  course,  a  number  of  examples  will  illustrate  the  power  of  CART. 


How  Does  CART  Work? 

Consider  the  example  of  how  high  risk  patients  (those  who  will  not  survive  at  least 
30  days  after  a  heart  attack  is  admitted)  were  identified  at  San  Diego  Medical  Center, 
University  of  California  on  the  basis  of  initial  24-h  data.  A  classification  rule  using 
at  most  three  decisions  (questions)  is  presented  in  Fig.  20.21.  Left  branches  of  the 
tree  represent  cases  of  positive  answers,  right  branches — negative  ones  so  that  e.g. 
if  minimum  systolic  blood  pressure  over  the  last  24  h  is  less  or  equal  9 1 ,  then  the 
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Fig.  20.21  Decision  tree  for 
low/high  patients 


patient  belongs  to  the  high  risk  group.  In  this  example  the  dependant  variable  is 
binary:  low  risk  (0)  and  high  risk  (1). 

A  different  situation  occurs  when  we  are  interested  in  the  expected  amount  of 
days  the  patient  will  be  able  to  survive.  The  decision  tree  will  probably  change  and 
the  terminal  nodes  will  now  indicate  a  mean  expected  number  of  days  the  patient 
will  survive.  This  situation  describes  a  regression  tree  rather  than  a  classification 
tree. 

In  a  more  formal  setup  let  Y  be  a  dependent  variable — binary  or  continuous  and 
X  G  Rci .  We  are  interested  in  approximating 

f  (x)  =  E(Y\X  =  x) 

For  the  definition  of  conditional  expectations  we  refer  to  Sect.  4.2.  CART  estimates 
this  function  /  by  a  step  function  that  is  constructed  via  splits  along  the  coordinate 
axis.  An  illustration  is  given  in  Fig.  20.22.  The  regression  function  /  (x)  is 
approximated  by  the  values  of  the  step  function.  The  splits  along  the  coordinate 
axes  are  to  be  determined  from  the  data. 

The  following  simple  one-dimensional  example  shows  that  the  choice  of 
splits  points  involves  some  decisions.  Suppose  that  f  (x)  —  I  (x  e  [0,  1])  + 
2 1  (x  e  [1,2])  is  a  simple  step  function  with  a  step  at  x  =  1.  Assume  now  that 
one  observes  Y\  —  f  (x,)  +  £/ ,  ~  U[0,2\,Sj  ~  N  (0, 1).  By  going  through 

the  X  data  points  as  possible  split  points  one  sees  that  in  the  neighbourhood  of 
x  —  l  one  has  two  possibilities:  one  simply  takes  the  Xt  left  to  1  or  the  observation 
right  to  1.  In  order  to  make  such  splits  unique  one  averages  these  neighbouring 
points. 
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x 

Fig.  20.22  CART  orthogonal  splitting  example  where  each  colour  corresponds  to  one  cluster 


Impurity  Measures 

A  more  formal  framework  on  how  to  split  and  where  to  split  needs  to  be  developed. 
Suppose  there  are  n  observations  in  the  learning  sample  and  ri  j  is  the  overall  number 
of  observations  belonging  to  class  y,  j  —  1 The  class  probabilities  are: 

=  =  (20.31) 

n 

n  ( j )  is  the  proportion  of  observations  belonging  to  a  particular  class.  Let  n(t ) 
be  the  number  of  observations  at  node  t  and  rij  ( t ) — the  number  of  observations 
belonging  to  the  j  -th  class  at  t .  The  frequency  of  the  event  that  an  observation  of 
the  y-th  class  falls  into  node  t  is: 


p(j,t)  =  7T(j) 


(20.32) 


J 

The  proportion  of  observations  at  t  are  p(t)  —  P(j>t)  the  conditional 

7=1 

probability  of  an  observation  to  belong  to  class  j  given  that  it  is  at  node  t  is: 


p(j\t) 


pit ) 


njit) 

nit ) 


(20.33) 
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Define  now  a  degree  of  class  homogeneity  in  a  given  node.  This  characteristic — 
an  impurity  measure  i  (t) — will  represent  a  class  homogeneity  indicator  for  a  given 
tree  node  and  hence  will  help  to  find  optimal  splits.  Define  an  impurity  function  t(t) 

j 

which  is  determined  on  (p\, . . . ,  pj)  G  [0,  l]J  with  ^  pj  —  1  so  that: 

7=1 

1 .  i  has  a  unique  maximum  at  point  (y ,  j , . . . ,  j ) ; 

2.  i  has  a  unique  minimum  at  points  (1, 0,  0, . . . ,  0),  (0, 1, 0, . . . ,  0),  . . 
(0,0,0, . . . ,  1); 

3.  i  is  a  symmetric  function  of  p\, . . . ,  pj 

Each  function  satisfying  these  conditions  is  called  an  impurity  function.  Given  t , 
define  the  impurity  measure  i  (t)  for  a  node  t  as: 

i(t)  =  i  {p(l\ t),  p(2\  t), . . . ,  p(J\t)}  (20.34) 

Denote  an  arbitrary  data  split  by  then  for  a  given  node  t  which  we  will  call 
a  parent  node  two  child  nodes  described  in  Fig.  20.23  arise:  /T  and  tR  representing 
observations  meeting  and  not  meeting  the  split  criterion  s.  A  fraction  Pl  of  data 
from  t  falls  to  the  left  child  node  and  Pr  —  1  —  Pl  is  the  share  of  data  in  tR . 

A  quality  measure  of  how  well  split  s  works  is: 

A i  (s,  t )  =  i  (t)  -  pLi  (tL)  -  pRi  (tR)  (20.35) 

The  higher  the  value  of  A i(s,t)  the  better  split  we  have  since  data  impurity  is 
reduced.  In  order  to  find  an  optimal  split  s  it  is  natural  to  maximise  A i(s,t).  Note 
that  in  (20.35)  for  different  splits  s,  the  value  of  i(t)  remains  constant,  hence  it  is 
equivalent  to  find 


s*  =  argmax  A  i  (s,  t ) 

s 

=  argmax  {-pLi  (tL)  -  Pri  (/«)} 

S 

=  argmax  {pLi  (tL)  +  pRi  (tR)} 

S 


Fig.  20.23  Parent  and  child 
nodes  hierarchy 
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where  /T  and  tR  are  implicit  functions  of  s.  This  splitting  procedure  is  repeated  until 
one  arrives  at  a  minimal  bucket  size.  Classes  are  then  assigned  to  terminal  nodes 
using  the  following  rule: 

If  p(j\t)  =  maxp(i\t),  then  j*(t)  —  j  (20.36) 

i 

If  the  maximum  is  not  unique,  then  j  *  ( t )  is  assigned  randomly  to  those  classes 
for  which  p(i\t)  takes  its  maximum  value.  The  crucial  question  is  of  course  to 
define  an  impurity  function  i  ( t ).  A  natural  definition  of  impurity  is  via  a  variance 
measure:  Assign  1  to  all  observations  at  node  t  belonging  to  class  j  and  0  to  others. 
A  sample  variance  estimate  for  node  t  observations  is  p(j  \  t)  {1  —  p(j\t)}. 
Summing  over  all  J  classes  we  obtain  the  Gini  index : 

j  j 

i  (t)  =  ^2p(J\t){l  -  p(j\t)}  =  1  ~^2p2(j\t)  (20.37) 

i  =  i  y= i 

The  Gini  index  is  an  impurity  function  i(p\, . . . ,  pj),  p j  —  p(j  \  t).  It  is  not 
hard  see  that  the  Gini  index  is  a  convex  function.  Since  pi  +  pr  —  1,  we  get: 

i(tL)pL  +i(tR)pR  =  i{p(\\tL),...,p(J\tL)}pL  +  i{p(l\tR),...,p(J\tR)}pR 

<c{pLp(  1|  tL)  +  pRp(  1|  tR), . . . ,  PlP(J\*l)  +  PrP(J\  tR)} 

where  inequality  becomes  an  equality  in  case  p(j  \  tL)  =  p(j  \tR),  j  —  1 
Recall  that 


pUJl) 

P(t ) 


p(!l) 

P(t ) 


P(j,tL) 

p(tL) 


=  PLP(j\tL ) 


and  since 


p(j\t) 


P(j’tL)  +  p(j,tR) 
pit) 


PLP(j\tL)  +  PRP(j\tR) 


we  can  conclude  that 


i  ( tL ) pL  +  i  ( tR ) Pr  <  i  (0  (20.38) 

Hence  each  variant  of  data  split  leads  to  A i(s,t)  >  0  unless  p(j\tR)  = 
p(j  \tR)  =  p(j\t),  i.e.  when  no  split  decreases  class  heterogeneity. 

Impurity  measures  can  be  defined  in  a  number  of  different  ways,  for  practical 
applications  the  so-called  twoing  rule  can  be  considered.  Instead  of  maximising 
impurity  change  at  a  particular  node,  the  twoing  rule  tries  to  balance  as  if  the 
learning  sample  had  only  two  classes.  The  reason  for  such  an  algorithm  is  that  such 
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a  decision  rule  is  able  to  distinguish  observations  between  general  factors  on  top 
levels  of  the  tree  and  take  into  account  specific  data  characteristics  at  lower  levels. 
If  S  =  {1 ,...,/}  is  the  set  of  learning  sample  classes,  divide  it  into  two  subsets 


Si  =  {ju...,jn},  and52  =  S\S i 


All  observations  belonging  to  S\  get  dummy  class  1,  and  the  rest  dummy  class  2. 
The  next  step  is  to  calculate  A  i(s,  t )  for  different  s  as  if  there  were  only  two  (dummy) 
classes.  Since  actually  A i  (s,  t )  depends  on  S i,  the  value  A i  (s,  t,  S i)  is  maximised. 
Now  apply  a  two-step  procedure :  first,  find  s*(S i)  maximising  Ai(s,t,S\)  and 
second,  find  a  superclass  S*  maximising  A i  {s*(S\),  t,  5*1 }.  In  other  words  the  idea 
of  twoing  is  to  find  a  combination  of  superclasses  at  each  node  that  maximises  the 
impurity  increment  for  two  classes. 

This  method  provides  one  big  advantage:  it  finds  the  so-called  strategic  nodes , 
i.e.  nodes  filtering  observations  in  the  way  that  they  are  different  to  the  maximum 
feasible  extent.  Although  applying  the  twoing  rule  may  seem  to  be  desirable  espe¬ 
cially  for  data  with  a  big  number  of  classes,  another  challenge  arises:  computational 
speed.  Let’s  assume  that  the  learning  sample  has  J  classes,  then  a  set  S  can  be 
split  into  S\  and  S2  by  2J~l  ways.  For  11  classes  data  this  will  create  more  than 
1,000  combinations.  Fortunately  the  following  result  helps  to  reduce  drastically  the 
amount  of  computations. 

It  can  be  proven  (Breiman  et  al.  ,  1984)  that  in  a  classification  task  with  two 
classes  and  impurity  measure  p(l\t)p(2\t)  for  an  arbitrary  split  s  a  superclass 
S\  (, s )  is  determined  by: 


Si  00  =  !./  :  p(j\tL)  >  p(j\tR)}, 


max  A  i  (s,  t,  S 1) 
Si 


PlPr 


j 


j= 1 


(20.39) 


Hence  the  twoing  rule  can  be  applied  in  practice  as  well  as  Gini  index,  although 
the  first  criterion  works  a  bit  slower. 


Gini  Index  and  Twoing  Rule  in  Practice 

In  this  section  we  look  at  practical  issues  of  using  these  two  rules.  Consider 
a  learning  dataset  from  Salford  Systems  with  400  observations  characterising 
automobiles:  their  make,  type,  colour,  technical  parameters,  age  etc.  The  aim  is  to 
build  a  decision  tree  splitting  different  cars  by  their  characteristics  based  on  feasible 
relevant  parameters.  The  classification  tree  constructed  using  the  Gini  index  is  given 
in  Fig.  20.24. 
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Fig.  20.24  Classification  tree  constructed  by  Gini  index 


Fig.  20.25  Classification  tree  constructed  by  twoing  index 


A  particular  feature  here  is  that  at  each  node  observations  belonging  to  one  make 
are  filtered  out,  i.e.  observations  with  most  striking  characteristics  are  separated.  As 
a  result  a  decision  tree  is  able  to  pick  out  automobile  makes  quite  easily. 

The  twoing  rule  based  tree  Fig.  20.25  for  the  same  data  is  different.  Instead  of 
specifying  particular  car  makes  at  each  node,  application  of  the  twoing  rule  results 
in  strategic  nodes,  i.e.  questions  which  distinguish  between  different  car  classes  to 
the  maximum  extent.  This  feature  can  be  vital  when  high-dimensional  datasets  with 
a  big  number  of  classes  are  processed. 
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Optimal  Size  of  a  Decision  Tree 

Up  to  now  we  were  interested  in  determining  the  best  split  s*  at  a  particular  node. 
The  next  and  perhaps  more  important  question  is  how  to  determine  the  optimal  tree 
size,  i.e.  when  to  stop  splitting.  If  each  terminal  node  has  only  class  homogenous 
dataset,  then  every  point  of  the  learning  sample  can  be  flawlessly  classified  using 
this  maximum  tree.  But  can  be  such  an  approach  fruitful? 

The  maximum  tree  is  a  case  of  overspecification.  Some  criterion  is  required  to 
stop  data  splitting.  Since  tree  building  is  dependent  on  Ai  (s,  t),  a  criterion  is  to  stop 
data  splitting  if 


A i  (s,  t)  <  /3 


(20.40) 


where  /3  is  some  threshold  value. 

The  value  of  /3  is  to  be  chosen  in  a  subjective  way  and  this  is  unfortunately 
a  drawback.  Empirical  simulations  show  that  the  impurity  increment  is  frequently 
non-monotone,  that  is  why  even  for  small  /3  the  tree  may  be  underparametrised. 
Setting  even  smaller  values  for  /3  will  probably  remedy  the  situation  but  at  the  cost 
of  tree  overparametrisation. 

Another  way  to  determine  the  adequate  shape  of  a  decision  tree  is  to  demand 
a  minimum  number  of  observations  N  (bucked  size)  at  each  terminal  node.  A 
disadvantage  is  that  if  at  terminal  node  t  the  number  of  observations  is  higher 

N(t)  >  ~N  (20.41) 

then  this  node  is  also  being  split  as  data  are  still  not  supposed  to  be  clustered  well 
enough. 


Cross-Validation  for  Tree  Pruning 

Cross-validation  is  a  procedure  which  uses  the  bigger  data  part  as  a  training  set  and 
the  rest  as  a  test  set.  Then  the  process  is  looped  so  that  different  parts  of  the  data 
become  learning  and  training  set,  so  that  at  the  end  each  datapoint  was  employed 
both  as  a  member  of  test  and  learning  sets.  The  aim  of  this  procedure  is  to  extract 
maximum  information  from  the  learning  sample  especially  in  the  situations  of  data 
scarceness. 

The  procedure  is  implemented  in  the  following  way.  First,  the  learning  sample 
is  randomly  divided  into  V  parts.  Using  the  training  set  from  the  union  of  (V  —  1) 
subsets  a  decision  tree  is  constructed  while  the  test  set  is  used  to  verify  the  tree 
quality.  This  procedure  is  looped  over  all  possible  subsets. 

Unfortunately  for  small  values  of  V  cross-validation  estimates  can  be  unstable 
since  each  iteration  a  cluster  of  data  is  selected  randomly  and  the  number  of 
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iterations  itself  is  relatively  small,  thus  the  overall  estimation  result  is  somewhat 
random.  Nowadays  cross-validation  with  V  =  10  is  an  industry  standard  and  for 
many  applications  a  good  balance  between  computational  complexity  and  statistical 
precision. 


Cost-Complexity  Function  and  Cross-Validation 


Another  method  taken  into  account  is  tree  complexity ,  i.e.  the  number  of  terminal 
nodes.  The  maximum  tree  will  get  a  penalty  for  its  big  size,  on  the  other  hand  it 
will  be  able  to  make  perfect  in-sample  predictions.  Small  trees  will,  of  course,  get 
lower  penalty  for  their  size  but  their  prediction  abilities  are  limited.  Optimisation 
procedure  based  on  such  a  trade-off  criterion  could  determine  a  good  decision  tree. 
Define  the  internal  misclassification  error  of  an  arbitrary  observation  at  node 

t  as  e(t)  =  1  —  max  p(j  \  t),  define  also  E(t )  =  e(t)p(t).  Then  internal 

j 

misclassification  tree  error  is  E(T )  =  ^  E(t)  where  T  is  a  set  of  terminal 

teT 


nodes.  The  estimates  are  called  internal  because  they  are  based  solely  on  the 
learning  sample.  It  may  seem  that  E(T)  as  a  tree  quality  measure  is  sufficient  but 
unfortunately  it  is  not  so.  Consider  the  case  of  the  maximum  tree,  here  E(TMax)  = 
0,  i.e.  the  tree  is  of  best  configuration. 


For  any  subtree  T  (<  TMax)  define  the  number  of  terminal  nodes  T 


as  a 


measure  of  its  complexity.  The  following  cost-complexity  function  can  be  used: 


Ea(T)  =  E(T)+a  T 


(20.42) 


where  a  >  0  is  a  complexity  parameter  and  a  T  is  a  cost  component.  The  more 
complex  the  tree  (high  number  of  terminal  nodes)  the  lower  is  E(T)  but  at  the  same 
time  the  higher  is  the  penalty  a  T  and  vice  versa. 

The  number  of  subtrees  of  7MAx  is  finite.  Hence  pruning  of  7Max  leads  to 
creation  of  a  subtree  sequence  T\ ,  T2,  T3, . . .  with  a  decreasing  number  of  terminal 
nodes. 

An  important  question  is  if  a  subtree  T  <  7MAx  for  a  given  a  minimising  Ea(T ) 
always  exists  and  whether  it  is  unique? 

In  Breiman  et  al.  (1984)  it  is  shown  that  for  Vaf  >  0  there  exists  an  optimal  tree 
T  (o')  in  the  sense  that 

1.  Ea{T(a)}=  min  Ea{T)  —  min  \E(T)  +  a  f  1} 

T<7max  T  <  7max  1 

2.  if  Ea(T )  =  Ea  {T(0f)}  then  T{a)  <  T. 


This  result  is  a  proof  of  existence,  but  also  a  proof  of  uniqueness:  consider 
another  subtree  Tf  so  that  T  and  Tf  both  minimise  Ea  and  are  not  nested,  then 
T  (a)  does  not  exist  in  accordance  with  second  condition. 
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The  idea  of  introducing  cost-complexity  function  at  this  stage  is  to  check  only  a 
subset  of  different  subtrees  of  Tmax-  optimal  subtrees  for  different  values  of  a.  The 
starting  point  is  to  define  the  first  optimal  subtree  in  the  sequence  so  that  E(T{)  — 
E(Tmax)  and  the  size  of  T\  is  minimum  among  other  subtrees  with  the  same  cost 
level.  To  get  T\  out  of  TMax  for  each  terminal  node  of  TMax  it  is  necessary  to 
verify  the  condition  E(t )  =  E{ti)  +  E(Jr )  and  if  it  is  fulfilled — node  t  is  pruned. 
The  process  is  looped  until  no  extra  pruning  is  available — the  resulting  tree  T  (0) 
becomes  T\ . 

Define  a  node  t  as  an  ancestor  of  tr  and  t'  as  descendant  of  t  if  there  is  a 
connected  path  down  the  tree  leading  from  t  to  t'.  Consider  Fig.  20.26  where  nodes 
t4,t5,ts,tg,  t io  and  t\  \  are  descendants  of  ^  while  nodes  te  and  E  are  not  descendants 
of  t2  although  they  are  positioned  lower  since  it  is  not  possible  to  connect  them  with 
a  path  from  E  to  these  nodes  without  engaging  t\.  Nodes  1 4,  ^  and  t\  are  ancestors 
of  tc,  and  E  is  not  ancestor  of  to,. 

Define  the  branch  Tt  of  the  tree  T  as  a  subtree  based  on  node  t  and  all  its 
descendants.  An  example  is  given  in  Fig.  20.27 .  Pruning  a  branch  Tt  from  a  tree 
T  means  deleting  all  descendant  nodes  of  t .  Denote  the  transformed  tree  as  T  —  Tt . 
Pruning  the  branch  Ttl  results  in  the  tree  described  in  Fig.  20.28. 

For  any  branch  Tt  define  the  internal  mis  classification  estimate  as: 

E{Tt)  =  E  E(t')  (20.43) 

t'eT, 

where  T,  is  the  set  of  terminal  nodes  of  T, .  Hence  for  an  arbitrary  node  t  of  T, : 


Fig.  20.26  Decision  tree  hierarchy 
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Fig.  20.27  The  branch  Ttl  of 
the  original  tree  T 


Fig.  20.28  T  -  Ttl  the 
pruned  tree  T 


E(t )  >  E(Tt) 


(20.44) 


Consider  now  the  cost -complexity  misclassification  estimate  for  branches  or 
single  nodes.  Define  for  a  single  node  { t }: 


E({t})  =  E(t)+a 


and  for  a  branch: 


EaiTt )  =  E(Tt )  +  a 


(20.45) 


(20.46) 


When  Ea(Tt )  <  Ea  ({/^})  the  branch  Tt  is  preferred  to  a  single  node  {t}  according 
to  cost-complexity.  For  some  a  both  (20.45)  and  (20.46)  will  become  equal.  This 
critical  value  of  a  can  be  determined  from: 


Ea{Tt)  <  Ea  ({t}) 


which  is  equivalent  to 


E(t)  -  E(Tt) 
a  <  — - 

Tt  -l 


(20.47) 


(20.48) 


where  a  >  0  since  E(t)  >  E(T, ) 
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To  obtain  the  next  member  of  the  subtrees  sequence,  i.e.  T2  out  of  T\  a  special 
node  called  weak  link  is  determined.  For  this  purpose  a  function  ^  e  T\  is 

defined  as 


, ,  i  £(TT  -  tt  f  1 

gi(t)  =  <  T>  -*  _ 

(  +OO,  t  G  T\ 

(20.49) 

Node  t\  is  a  weak  link  in  T\  if 

g\(h)  =  min  gl(t) 

teT\ 

(20.50) 

and  a  new  value  for  a2  is  defined  as 

a  2  =  g\(h) 

(20.51) 

A  new  tree  T2  <  T\  in  the  sequence  is  obviously  defined  by  pruning  the  branch 

7), ,  i-e. 


T2  =  Ti-  Th  (20.52) 

The  process  is  looped  until  root  node  {to} — the  final  member  of  sequence — is 
reached.  When  there  are  multiple  weak  links  detected,  for  instance  gk  (4)  =  gk 
then  both  branches  are  pruned,  i.e.  Tk+\  =  Tk  —  Tj  —  T-g . 

In  this  way  it  is  possible  to  get  the  sequence  of  optimal  subtrees  TMax  >  T\  > 
T2  >  T3  >  •••  >  {t0}  for  which  it  is  possible  to  prove  that  the  sequence  {oik}  is 
increasing,  i.e.  otk  <  Mk+ 1,  k  >  1  and  a  1  =  0.  For  k  >  1:  otk  <  ot  <  oik+ 1  and 
T(a )  =  T(ak)  =  Tk. 

Practically  this  tells  us  how  to  implement  the  search  algorithm.  First,  the 
maximum  tree  TMAx  is  taken,  then  T\  is  found  and  a  weak  link  t\  is  detected  and 
branch  T~h  is  pruned  off,  a2  is  calculated  and  the  process  is  continued. 

When  the  algorithm  is  applied  to  T\,  the  number  of  pruned  nodes  is  usually 
quite  significant.  For  instance,  consider  the  following  typical  empirical  evidence 
(see  Table  20.6).  When  the  trees  become  smaller,  the  difference  in  the  number  of 
terminal  nodes  also  gets  smaller. 

Finally,  it  is  worth  mentioning  that  the  sequence  of  optimally  pruned  subtrees 
is  a  subset  of  trees  which  might  be  constructed  using  direct  method  of  internal 
misclassification  estimator  minimisation  given  a  fixed  number  of  terminal  nodes. 


Table  20.6  Typical  pruning  speed 


Tree 

T\ 

t2 

r3 

t4 

Ts 

t6 

Ti 

Ts 

t9 

Tv> 

Tn 

T\2 

Tn 

fk 

71 

63 

58 

40 

34 

19 

10 

9 

1 

6 

5 

2 

1 
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Consider  an  example  of  tree  T(a)  with  7  terminal  nodes,  then  there  is  no  other 
subtree  T  with  7  terminal  nodes  having  lower  E(T).  Otherwise 


Ea(T)  =  E(T )  +  la  <  Ea  {7(a)}  = 


min 

T  <  TMax 


Ea(T) 


which  is  impossible  by  definition. 

Applying  the  method  of  F-fold  cross-validation  to  the  sequence  TMax  >  T\  > 
T2  >  T3  >  •••  >-  {to},  an  optimal  tree  is  determined.  On  the  other  hand  it  is 
frequently  pointed  out  that  choice  of  tree  with  minimum  value  of  ECW(T )  is  not 
always  adequate  since  Ecy(T)  is  not  too  robust,  i.e.  there  is  a  whole  range  of  values 
Ecy(T)  satisfying  Ecy(T)  <  E^m(T)  +  s  for  small  e  >  0.  Moreover,  when 
V  <  N  a  simple  change  of  random  generator  seed  will  definitely  result  in  changed 

yv 

values  of  Tk  minimising  E(Tk )•  Hence  a  so-called  one  standard  error  empirical 
rule  is  applied  which  states  that  if  7^0  is  the  tree  minimising  Ecy(Tk0)  from  the 
sequence  7"Max  ^  T\  >  T2  >  T3  >  -  -  -  >  {to},  then  a  value  k\  and  a  correspondent 
tree  7^,  are  selected  so  that 


argmax  E(Tkl)  <  E{Tk{))  +  a  \E(Th 


(20.53) 


where  a(-)  denotes  sample  estimate  of  standard  error  and  E(-) — the  relevant  sample 
estimators. 

A 

The  dotted  line  in  Fig.  20.29  shows  the  area  where  the  values  of  E(Tk)  only 

yv 

slightly  differ  from  min  E(Tk).  The  left  edge  which  is  roughly  equivalent  to  16 

\fk\ 

terminal  nodes  shows  the  application  of  one  standard  error  rule.  The  use  of  one 
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Fig.  20.29  The  example  of  relationship  between  F(7\)  and  number  of  terminal  nodes 
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standard  error  rule  allows  not  only  to  achieve  more  robust  results  but  also  to  get 
trees  of  lower  complexity  given  the  error  comparable  with  min  E  (7^). 

\n\ 


Regression  Trees 

Up  to  now  we  concentrate  on  classification  trees.  Although  regression  trees 
share  a  similar  logical  framework,  there  are  some  differences  which  need  to  be 
addressed.  The  important  difference  between  classification  and  regression  trees  is 
the  type  of  dependent  variable  Y .  When  Y  is  discrete,  a  decision  tree  is  called  a 
classification  tree,  a  regression  tree  is  a  decision  tree  with  a  continuous  dependent 
variable. 

Gini  index  and  twoing  rule  discussed  in  previous  sections  assume  that  the 
number  of  classes  is  finite  and  hence  introduce  some  measures  based  mainly 
on  p(j  \t)  for  arbitrary  class  j  and  node  t.  But  since  in  case  of  continuous 
dependent  variable  there  are  no  more  classes,  this  approach  cannot  be  used 
anymore  unless  groups  of  continuous  values  are  effectively  substituted  with  artificial 
classes.  Since  there  are  no  classes  anymore — how  can  be  the  maximum  regression 
tree  determined?  Analogously  with  discrete  case,  absolute  homogeneity  can  be 
then  described  only  after  some  adequate  impurity  measure  for  regression  trees  is 
introduced. 

Recall  the  idea  of  Gini  index ,  then  it  becomes  quite  natural  to  use  the  variance  as 
impurity  indicator.  Since  for  each  node  data  variance  can  be  easily  computed,  then 
splitting  criterion  for  an  arbitrary  node  t  can  be  written  as 

5*  =  argmax  [pLvar  {tL(s)}  +  pRvdr{tR(s)}]  (20.54) 

£ 

where  and  tp>  are  emerging  child  nodes  which  are,  of  course,  directly  dependent 
on  the  choice  of  s*. 

Hence  the  maximum  regression  tree  can  be  easily  defined  as  a  structure  where 
each  node  has  only  the  same  predicted  values.  It  is  important  to  point  out  that  since 
continuous  data  have  much  higher  chances  to  take  different  values  comparing  with 
discrete  ones,  the  size  of  maximum  regression  tree  is  usually  very  big. 

When  the  maximum  regression  tree  is  properly  defined,  it  is  then  of  no  problem 
to  get  an  optimally  size  tree.  Like  with  classification  trees,  maximum  regression  tree 
is  usually  supposed  to  be  upwardly  pruned  with  the  help  of  cost-complexity  function 
and  cross-validation.  That  is  why  the  majority  of  results  presented  above  is  applied 
to  regression  trees  as  well. 
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Fig.  20.30  Decision  tree  for 
bankruptcy  dataset:  Gini 
index,  N  =  30 
O  MVACARTBanl 


Bankruptcy  Analysis 


This  section  provides  a  practical  study  on  bankruptcy  data  involving  decision  trees. 
A  dataset  with  84  observations  representing  different  companies  is  constituted  by 
three  variables: 

-  net  income  to  total  assets  ratio 

-  total  liabilities  to  total  assets  ratio 

-  company  status  (—1  if  bankrupt  and  1  if  not) 

The  data  is  from  SEC  (2004). 

The  goal  is  to  predict  and  describe  the  company  status  given  the  two  primary 
financial  ratios.  Since  no  additional  information  like  the  functional  form  of  possible 
relationship  is  available,  the  use  of  a  classification  tree  is  an  active  alternative. 

The  tree  given  in  Fig.  20.30  was  constructed  using  the  Gini  index  and  a  N  — 
30  constraint,  i.e.  the  number  of  points  in  each  of  the  terminal  nodes  can  not  be 
more  than  30.  Numbers  in  parentheses  displayed  on  terminal  nodes  are  observation 
quantities  belonging  to  Class  1  and  Class  —  1 . 

If  we  loose  the  constraint  to  A  =  10,  the  decision  rule  changes,  see  Fig.  20.31. 
How  exactly  did  the  situation  change?  Consider  the  Class  1  terminal  nodes  of  the 
tree  on  Fig.  20.30.  The  first  one  contains  21  observations  and  thus  was  split  for 
N  —  10.  When  it  was  split  two  new  nodes  of  different  classes  emerged  and  for  both 
of  them  the  impurity  measure  has  decreased. 

We  may  conclude  that  N  ^  10  is  a  good  choice  and  analysing  the  tree  produced 
we  can  state  that  for  this  particular  example  the  net  income  to  total  assets  (Xi) 
ratio  appears  to  be  an  important  class  indicator.  The  successful  classification  ratio 
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Fig.  20.31  Decision  tree  for  bankruptcy  dataset:  Gini  index,  N  =  10  Q  MVACARTBan2 


Classification  ratio  by  minsize  parameter 


MinSize  parameter 

Fig.  20.32  Successful  classification  ratio  dynamic  over  the  number  of  terminal  nodes:  cross- 
validation  Q  MVAbancrupcydis 


dynamic  over  the  number  of  terminal  nodes  is  shown  in  Fig.  20.32.  It  is  chosen  by 
cross-validation  method. 

For  this  example  with  relatively  small  sample  size  we  construct  two  maximum 
trees — using  the  Gini  and  twoing  rules,  see  Figs.  20.33  and  20.34.  Looking  at  both 
decision  trees  we  see  that  the  choice  of  impurity  measure  is  not  so  important  as  the 
right  choice  of  tree  size. 
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g.  20.33  Maximum  tree  constructed  employing  Gini  index  Q  MVACARTGiniTreel 
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'  -»»-  Summary 

^  CART  is  a  tree  based  method  splitting  the  data  sequentially  into  a 
binary  tree 

CART  determined  the  nodes  by  minimising  an  impurity  measure  at 
each  mode 

^  CART  is  non-parametric: 

When  no  data  structure  hypotheses  are  available,  non-parametric 
analysis  becomes  the  single  effective  data  mining  tool.  CART  is  a 
flexible  nonparametric  data  mining  tool 

CART  does  not  require  variables  to  be  selected  in  advance: 

From  a  learning  sample  CART  will  automatically  select  the  most 
significant  ones 

^  CART  is  very  efficient  in  computational  terms: 

Although  all  possible  data  splits  are  analysed,  the  CART  architec¬ 
ture  is  flexible  enough  to  do  all  of  them  quickly 

CART  is  robust  to  the  effect  of  outliers: 

Due  to  data- splitting  nature  of  decision  rules  creation  it  is  possible 
to  distinguish  between  datasets  with  different  characteristics  and 
hence  to  neutralise  outliers  in  separate  nodes 

CART  can  use  any  combination  of  continuous  and  categorical  data: 
Researchers  are  no  longer  limited  to  a  particular  class  of  data  and 
will  be  able  to  capture  more  real-life  examples 


20.6  Boston  Housing 

Coming  back  to  the  Boston  Housing  data  set,  we  compare  the  results  of  EPP  on  the 
original  data  X  and  the  transformed  data  X  motivated  in  Sect.  1.9.  So  we  exclude 
X4  (indicator  of  Charles  River)  from  the  present  analysis. 

The  aim  of  this  analysis  is  to  see  from  a  different  angle  whether  our  proposed 
transformations  yield  more  normal  distributions  and  whether  it  will  yield  data  with 
less  outliers.  Both  effects  will  be  visible  in  our  projection  pursuit  analysis. 

We  first  apply  the  Jones  and  Sibson  index  to  the  non-transformed  data  with  50 
randomly  chosen  13-dimensional  directions.  Figure  20.35  displays  the  results  in  the 
following  form.  In  the  lower  part,  we  see  the  values  of  the  Jones  and  Sibson  index. 
It  should  be  constant  for  13 -dimensional  normal  data.  We  observe  that  this  is  clearly 
not  the  case.  In  the  upper  part  of  Fig.  20.35  we  show  the  standard  normal  density  as 
a  green  curve  and  two  densities  corresponding  to  two  extreme  index  values.  The  red, 
slim  curve  corresponds  to  the  maximal  value  of  the  index  among  the  50  projections. 
The  blue  curve,  which  is  close  to  the  normal,  corresponds  to  the  minimal  value  of 
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Fig.  20.35  Projection  Pursuit 
with  the  Sibson-Jones  index 
with  13  original  variables 
O  MVAppsib 


Fig.  20.36  Projection  Pursuit 
with  the  Sibson-Jones  index 
with  13  transformed  variables 
Q  MVAppsib 


50  directions 


>- 


50  directions 


>- 


the  Jones  and  Sibson  index.  The  corresponding  values  of  the  indices  have  the  same 
colour  in  the  lower  part  of  Fig.  20.35.  Below  the  densities,  a  jitter  plot  shows  the 
distribution  of  the  projected  points  aT X[  (i  —  1, . . . ,  506).  We  conclude  from  the 
outlying  projection  in  the  red  distribution  that  several  points  are  in  conflict  with  the 
normality  assumption. 

Figure  20.36  presents  an  analysis  with  the  same  design  for  the  transformed  data. 
We  observe  in  the  lower  part  of  the  figure  values  that  are  much  lower  for  the  Jones 
and  Sibson  index  (by  a  factor  of  10)  with  lower  variability  which  suggests  that  the 
transformed  data  is  closer  to  the  normal.  (“Closeness”  is  interpreted  here  in  the  sense 
of  the  Jones  and  Sibson  index.)  This  is  confirmed  by  looking  to  the  upper  part  of 
Fig.  20.36  which  has  a  significantly  less  outlying  structure  than  in  Fig.  20.35. 
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20.7  Exercises 

Exercise  20.1  Calculate  the  Simplicial  Depth  for  the  Swiss  bank  notes  data  set  and 
compare  the  results  to  the  univariate  medians.  Calculate  the  Simplicial  Depth  again 
for  the  genuine  and  counterfeit  bank  notes  separately. 

Exercise  20.2  Construct  a  configuration  of  points  in  M2  such  that  xmed  j 
from  (20.2)  is  not  in  the  “centre”  of  the  scatterplot. 

Exercise  20.3  Apply  the  SIR  technique  to  the  US  companies  data  with  Y  — 
market  value  and  X  —  all  other  variables.  Which  directions  do  you  find? 

Exercise  20.4  Simulate  a  data  set  with  X  ~  A^O,  If),  Y  —  (X\  +  3Ay2  +  (X3  — 
X4)4  +  s  and  £  ~  N(0,  (0.1)2).  Use  SIR  and  SIR  II  to  find  the  EDR  directions. 

Exercise  20.5  Apply  the  Projection  Pursuit  technique  on  the  Swiss  bank  notes  data 
set  and  compare  the  results  to  the  PC  analysis  and  the  Fisher  discriminant  rule. 

Exercise  20.6  Apply  the  SIR  and  SIR  II  technique  on  the  car  data  set  in  Table  22.3 
with  Y  —  price. 

Exercise  20.7  Generate  four  regions  on  the  two-dimensional  unit  square  by 
sequentially  cutting  parallel  to  the  coordinate  axes.  Generate  100  two-dimensional 
Uniform  random  variables  and  label  them  according  to  their  presence  in  the  above 
regions.  Apply  the  CART  algorithm  to  find  the  regions  bound  and  to  classify  the 
observations. 

Exercise  20.8  Modify  Exercise  20.7  by  defining  the  regions  as  lying  above  and 
below  the  main  diagonal  of  the  unit  square.  Make  a  CART  analysis  and  comment 
on  the  complexity  of  the  tree. 

Exercise  20.9  Apply  the  SVM  with  different  radial  basis  parameter  r  and  different 
capacity  parameter  c  in  order  to  separate  two  circular  datasets.  This  example 
is  often  called  the  Orange  Peel  exercise  and  involves  two  Normal  distributions 
N(/jl,  Tif,  i  —  1,2,  with  covariance  matrices  Si  =  2X2  and  £2  =  0.522* 

Exercise  20.10  The  noisy  spiral  data  set  consists  of  two  intertwining  spirals  that 
need  to  be  separated  by  a  non-linear  classification  method.  Apply  the  SVM  with 
different  radial  basis  parameter  r  and  capacity  parameter  c  in  order  to  separate  the 
two  spiral  datasets. 

Exercise  20.11  Apply  the  SVM  to  separate  the  bankrupt  from  the  surviving  ( prof¬ 
itable )  companies  using  the  profitability  and  leverage  ratios  given  in  the  Bankruptcy 
data  set  in  Table  22.21. 
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Chapter  21 

Symbols  and  Notations 


Basics 


X,Y 

Random  variables  or  vectors 

Xx,X2,...,Xp 

Random  variables 

>< 

II 

^3 

Random  vector 

X  ~  • 

X  has  distribution  • 

A,  8 

Matrices 

53 

r,  a 

Matrices 

60 

x,y 

Data  matrices 

81 

s 

Covariance  matrix 

80 

h 

Vector  of  ones  (1, . . . ,  1)T 

n -times 

54 

o„ 

Vector  of  zeros  (0, . . . ,  0)T 

n -times 

54 

/(•) 

Indicator  function,  i.e.  for  a  set  M  is 
/  =  1  on  M,  /  =  0  otherwise 

• 

1 

V-i 

=>* 

Implication 

Equivalence 

Approximately  equal 

0 

Kronecker  product 

iff 

if  and  only  if,  equivalence 
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Mathematical  Abbreviations 


tr(^l) 

hull  (a  i, .  ..,Xk) 
diagOt) 
rank(^l) 
det(„4) 

C(A) 


Trace  of  matrix  A 

Convex  hull  of  points  {x\ , . . . ,  x^} 

Diagonal  of  matrix  A 

Rank  of  matrix  A 

Determinant  of  matrix  A 

Column  space  of  matrix  A 


Samples 


Observations  of  X  and  Y 
Sample  of  n  observations  of  X 

in  x  p )  data  matrix  of  observations  of  X\ , . . . ,  Xp  81 
or  of  X  =  (Xu...,Xp)T 

The  order  statistic  of  x\ , . . . ,  xn  5 

Centering  matrix,  7i  =  Xn  —  n~l  \n  lj  90 


x,y 

X\,  ,  xn  =  {Xi)ni  =  1 


X(\). 

n 


X(n) 


Densities  and  Distribution  Functions 


f(x) 

Density  of  X 

f(x,y) 

Joint  density  of  X  and  Y 

fx(x),fr(y) 

Marginal  densities  of  X  and  Y 

fx i(JCi),  • 

•  ■  *  fxp  te) 

Marginal  densities  of  X\ , . . . ,  Xp 

fh  (x) 

Histogram  or  kernel  estimator  of  f(x) 

F(x) 

Distribution  function  of  X 

F(x,y ) 

Joint  distribution  function  of  X  and  Y 

Fx(x),  FY(y ) 

Marginal  distribution  functions  of  X  and  Y 

Fx i  (*i),  • 

■■,fx„{xp) 

Marginal  distribution  functions  of  X\ , . . . ,  X 

(p(x) 

Density  of  the  standard  normal  distribution 

Standard  normal  distribution  function 

Vx(t) 

Characteristic  function  of  X 

mk 

k- th  moment  of  X 

Ki 

Cumulants  or  semi-invariants  of  X 
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Moments 


EX,  EY 

aXY  =  Cov(X,  Y) 

cxx  =  Var(X) 

_  Cov(X,  Y ) 

v/Var(X)  Var(F) 
Sat  =  Cov(X,  F) 

Saa  =  Var(X) 


Mean  values  of  random  variables  or  vectors  X  and  Y  80 
Covariance  between  random  variables  X  and  Y  80 

Variance  of  random  variable  X  80 

Correlation  between  random  variables  X  and  Y  84 


Covariance  between  random  vectors  X  and  Y, 
i.e.,  Cov(X,  Y)  =  E(X  -  EX)(Y  -  EY)T 

Covariance  matrix  of  the  random  vector  X 


Empirical  Moments 


X 


1  H 

=  -E* 

n  , 

i  —  l 


SXY 


1  x  -v 

=  -  Vo i  -  x)(yt  -  7) 

n  “ 

l  =  1 


Sxx 


-  XA  - x )2 

n  f=i 


'AT 


S_XY 

*JSxxSYY 


s  =  {sxiXj}  =  xTnx 
n=  {rXiXj}  =  V-{/2SV-1/2 


Average  of  X  sampled  by  {x7-  7 


Empirical  covariance  of  random  variables  X  and  Y 
sampled  by  {x;}/= and  {yi}i= 

80 

Empirical  variance  of  random  variable  X  sampled  by 
{x/ }/ =1 

80 

Empirical  correlation  of  X  and  F 

84 

Empirical  covariance  matrix  of  X\, . 
random  vector  A  =  (Xi , . . . ,  Xp)T 

. . ,  Xp  or  of  the 

80,  90 

Empirical  correlation  matrix  of  X\, . 
random  vector  X  =  (X\ , . . . ,  Xp)T 

. . ,  Xp  or  of  the 

84,91 

Distributions 


<p  0) 

*0) 

N(  0, 1) 
N(fl,  or2) 

NPQi,  S) 


Density  of  the  standard  normal  distribution 
Distribution  function  of  the  standard  normal  distribution 
Standard  normal  or  Gaussian  distribution 
Normal  distribution  with  mean  //  and  variance  a2 

p -Dimensional  normal  distribution  with  mean  /x  and 
covariance  matrix  E 
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L 


CLT 

y2 

AP 

X\—a\  p 


hi 

0— a/2  in 

F 

1  n,m 
F\  —  a;  n  ,m 


Convergence  in  distribution 

Central  Limit  Theorem 

X2  distribution  with  p  degrees  of  freedom 

1  —  of  quantile  of  the  yp  distribution  with  p  degrees  of 
freedom 

^-Distribution  with  n  degrees  of  freedom 
1  —  of/2  quantile  of  the  ^-distribution  with  n  d.f. 

F  -Distribution  with  n  and  m  degrees  of  freedom 

1—a  quantile  of  the  F  -distribution  with  n  and  m  degrees 
of  freedom 

Hotelling  T 2 -distribution  with  p  and  n  degrees  of 
freedom 
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Chapter  22 

Data 


All  data  sets  are  available  on  the  Springer  webpage  or  at  the  authors’  home  pages. 


22.1  Boston  Housing  Data 

The  Boston  housing  data  set  was  collected  by  Harrison  and  Rubinfeld  (1978).  It 
comprise  506  observations  for  each  census  district  of  the  Boston  metropolitan  area. 
The  data  set  was  analysed  in  Belsley,  Kuh,  and  Welsch  (1980). 

X\ :  Per  capita  crime  rate, 

X2:  Proportion  of  residential  land  zoned  for  large  lots, 

X2 :  Proportion  of  nonretail  business  acres, 

X4:  Charles  River  (1  if  tract  bounds  river,  0  otherwise), 

X$:  Nitric  oxides  concentration, 

X6 :  Average  number  of  rooms  per  dwelling, 

X7:  Proportion  of  owner-occupied  units  built  prior  to  1940, 

X8:  Weighted  distances  to  five  Boston  employment  centers, 

X9:  Index  of  accessibility  to  radial  highways, 

Xi0:  Full-value  property  tax  rate  per  $10,000, 

X]  i :  Pupil/teacher  ratio, 

Xl2:  1000(5  -0.63)2 1(B  <  0.63)  where  B  is  the  proportion  of  African  American, 

X\2\  %  lower  status  of  the  population, 

Xu:  Median  value  of  owner-occupied  homes  in  $1,000. 
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22  Data 


22.2  Swiss  Bank  Notes 

Six  variables  measured  on  100  genuine  and  100  counterfeit  old  Swiss  1000-franc 
bank  notes.  The  data  stem  from  Flury  and  Riedwyl  (1988).  The  columns  correspond 
to  the  following  six  variables. 


X\ :  Length  of  the  bank  note, 

X2 :  Height  of  the  bank  note,  measured  on  the  left, 
X3 :  Height  of  the  bank  note,  measured  on  the  right, 
X\.  Distance  of  inner  frame  to  the  lower  border, 

X5 :  Distance  of  inner  frame  to  the  upper  border, 

X6:  Length  of  the  diagonal. 


Observations  1-100  are  the  genuine  bank  notes  and  the  other  100  observations 
are  the  counterfeit  bank  notes. 


22.3  Car  Data 


The  car  data  set  (Chambers,  Cleveland,  Kleiner  &  Tukey,  1983)  consists  of  13 
variables  measured  for  74  car  types.  The  abbreviations  in  this  section  are  as  follows: 


Xi: 

P 

x2-. 

M 

Xy 

R78 

Xy 

R77 

X5: 

H 

X6 : 

R 

Xy. 

Tr 

Xs: 

W 

X9: 

L 

Xi0: 

T 

Xu- 

D 

Xn: 

G 

Xn- 

C 

Price, 

Mileage  (in  miles  per  gallon), 

Repair  record  1978  (rated  on  a  5-point  scale;  5  best,  1  worst), 

Repair  record  1977  (scale  as  before), 

Headroom  (in  inches), 

Rear  seat  clearance  (distance  from  front  seat  back  to  rear  seat,  in  inches), 
Trunk  space  (in  cubic  feet), 

Weight  (in  pound), 

Length  (in  inches), 

Turning  diameter  (clearance  required  to  make  a  U-turn,  in  feet), 
Displacement  (in  cubic  inches), 

Gear  ratio  for  high  gear, 

Company  headquarter  (1  for  USA,  2  for  Japan,  3  for  Europe). 


22.6  French  Food  Data 
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22.4  Classic  Blue  Pullovers  Data 

This  is  a  data  set  consisting  of  ten  measurements  of  four  variables.  The  story: 
A  textile  shop  manager  is  studying  the  sales  of  “classic  blue”  pullovers  over  ten 
periods.  He  uses  three  different  marketing  methods  and  hopes  to  understand  his 
sales  as  a  fit  of  these  variables  using  statistics.  The  variables  measured  are 


X\ :  Numbers  of  sold  pullovers, 

X2:  Price  (in  EUR), 

Xy  Advertisement  costs  in  local  newspapers  (in  EUR), 
Xy.  Presence  of  a  sales  assistant  (in  hours  per  period). 


22.5  US  Companies  Data 

The  data  set  consists  of  measurements  for  79  US  companies.  The  abbreviations  in 
this  section  are  as  follows: 


A 

Assets  (USD), 

x2 

S 

Sales  (USD), 

X} 

MV 

Market  value  (USD), 

x 4 

P 

Profits  (USD), 

Xs 

CF 

Cash  flow  (USD), 

X6 

E 

Employees. 

22.6  French  Food  Data 

The  data  set  consists  of  the  average  expenditures  on  food  for  several  different  types 
of  families  in  France  (manual  workers  =  MA,  employees  =  EM,  managers  =  CA) 
with  different  numbers  of  children  (2,  3,  4  or  5  children).  The  data  is  taken  from 
Lebart,  Morineau,  and  Fenelon  (1982). 
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22.7  Car  Marks 

The  data  are  averaged  marks  for  23  car  types  from  a  sample  of  40  persons.  The 
marks  range  from  1  (very  good)  to  6  (very  bad)  like  German  school  marks.  The 
variables  are: 

X] :  A  Economy, 

X2 :  B  Service, 

Xy  C  Non-depreciation  of  value, 

Xy  D  Price,  Mark  1  for  very  cheap  cars, 

X5 :  E  Design, 

X6 :  F  Sporty  car, 

Xy.  G  Safety, 

Xy  H  Easy  handling. 


22.8  French  Baccalaureat  Frequencies 

The  data  consist  of  observations  of  202,100  baccalaureats  from  France  in  1976  and 
give  the  frequencies  for  different  sets  of  modalities  classified  into  regions.  For  a 
reference  see  Bouroche  and  Saporta  (1980).  The  variables  (modalities)  are: 

X\ :  A  Philosophy-Letters, 

Xy.  B  Economics  and  Social  Sciences, 

Xy.  C  Mathematics  and  Physics, 

Xy.  D  Mathematics  and  Natural  Sciences, 

X5 :  E  Mathematics  and  Techniques, 

Xy.  F  Industrial  Techniques, 

X7:  G  Economic  Techniques, 

Xy  H  Computer  Techniques. 


22.9  Journaux  Data 

This  is  a  data  set  that  was  created  from  a  survey  completed  in  the  1980s  in 
Belgium  questioning  people’s  reading  habits.  They  were  asked  where  they  live  (10 
regions  comprised  of  7  provinces  and  3  regions  around  Brussels)  and  what  kind 
of  newspaper  they  read  on  a  regular  basis.  The  15  possible  answers  belong  to  3 
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classes:  Flemish  newspapers  (first  letter  v),  French  newspapers  (first  letter  f)  and 
both  languages  (first  letter  b). 


Xi 

WaBr 

Walloon  Brabant 

x2 

Brar 

Brussels  area 

x3 

Antw 

Antwerp 

X 4 

FIBr 

Flemish  Brabant 

*5 

OcFl 

Occidental  Flanders 

*6 

OrFl 

Oriental  Flanders 

Hain 

Hainaut 

*8 

Lieg 

Liege 

*9 

Limb 

Limburg 

X10:  Luxe 

Luxembourg 

22.10  US  Crime  Data 

This  is  a  data  set  consisting  of  50  measurements  of  7  variables.  It  states  for  1  year 
(1985)  the  reported  number  of  crimes  in  the  50  states  of  the  US  classified  according 
to  7  categories  (X3-X9). 

X\\  Land  area  (land) 

X2 :  Population  1985  (popu  1985) 

X3 :  Murder  (murd) 

X4:  Rape 

X5 :  Robbery  (robb) 

X6 :  Assault  (assa) 

X2:  Burglary  (burg) 

X8:  Larcery  (larc) 

X9:  Autothieft  (auto) 

Xi0:  US  states  region  number  (reg) 

An :  US  states  division  number  (div) 
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Division  numbers  Region  numbers 


New  England 

1 

Northeast 

1 

Mid  Atlantic 

2 

Midwest 

2 

E  N  Central 

3 

South 

3 

W  N  Central 

4 

West 

4 

S  Atlantic 

5 

E  S  Central 

6 

W  S  Central 

7 

Mountain 

8 

Pacific 

9 

22. 1 1  Plasma  Data 

In  Olkin  and  Veath  (1980),  the  evolution  of  citrate  concentration  in  the  plasma  is 
observed  at  three  different  times  of  day,  X\  (8  am),  X2  (1 1  am)  and  X3  (3  pm),  for 
two  groups  of  patients.  Each  group  follows  a  different  diet. 


V: 

8  am 

X2: 

11  am 

V: 

3  pm 

22.12  WAIS  Data 

Morrison  (1990)  compares  the  results  of  four  subtests  of  the  Wechsler  Adult 
Intelligence  Scale  (WAIS)  for  two  categories  of  people:  in  group  one  are  n  \  —  37 
people  who  do  not  present  a  senile  factor,  group  two  are  those  (ft  2  =  12)  presenting 
a  senile  factor. 


WAIS  subtests: 


V: 

Information 

*2: 

Similarities 

*3: 

Arithmetic 

X4: 

Picture  completion 
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22.13  ANOVAData 

The  yields  of  wheat  have  been  measured  in  30  parcels  which  have  been  randomly 
attributed  to  3  lots  prepared  by  one  of  3  different  fertilisers  A,  B  and  C. 


Xu 

Fertiliser  A 

X2: 

Fertiliser  B 

Xy 

Fertiliser  C 

22.14  Timebudget  Data 


In  Voile  (1985),  we  can  find  data  on  28  individuals  identified  according  to 
sex,  country  where  they  live,  professional  activity  and  matrimonial  status,  which 
indicates  the  amount  of  time  each  person  spent  on  ten  categories  of  activities  over 
100  days  (100  •  24  h  =  2,400  h  total  in  each  row)  in  the  year  1976. 


Xi 

prof : 

Professional  activity 

x2 

tran  : 

Transportation  linked  to  professional  activity 

x 3 

hous  : 

Household  occupation 

X4 

kids  : 

Occupation  linked  to  children 

Xs 

shop  : 

Shopping 

X6 

pers  : 

Time  spent  for  personal  care 

X7 

eat : 

Eating 

Xs 

slee  : 

Sleeping 

Xg 

tele  : 

Watching  television 

Xio'.  leis : 

Other  leisures 

maus: 

Active  men  in  the  USA 

waus: 

Active  women  in  the  USA 

wnus: 

Nonactive  women  in  the  USA 

mmus: 

Married  men  in  USA 

wmus: 

Married  women  in  USA 

msus: 

Single  men  in  USA 

wsus: 

Single  women  in  USA 

mawe: 

Active  men  from  Western  countries 

wawe: 

Active  women  from  Western  countries 
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wnwe: 

mmwe: 

wmwe: 

mswe: 

wswe: 

mayo: 

wayo: 

wnyo: 

mmyo: 

wmyo: 

msyo: 

wsyo: 

maes: 

waes: 

wnes: 

mmes: 

wmes: 

mses: 

wses: 


Nonactive  women  from  Western  countries 
Married  men  from  Western  countries 
Married  women  from  Western  countries 
Single  men  from  Western  countries 
Single  women  from  Western  countries 
Active  men  from  Yugoslavia 
Active  women  from  Yugoslavia 
Nonactive  women  from  Yugoslavia 
Married  men  from  Yugoslavia 
Married  women  from  Yugoslavia 
Single  men  from  Yugoslavia 
Single  women  from  Yugoslavia 
Active  men  from  Eastern  countries 
Active  women  from  Eastern  countries 
Nonactive  women  from  Eastern  countries 
Married  men  from  Eastern  countries 
Married  women  from  Eastern  countries 
Single  men  from  Eastern  countries 
Single  women  from  Eastern  countries 


22.15  Geopol  Data 

This  data  set  contains  a  comparison  of  41  countries  according  to  10  different 
political  and  economic  parameters. 

X\.  popu  Population 

X2:  giph  Gross  Internal  Product  per  habitant 

X3 :  ripo  Rate  of  increase  of  the  population 

X4:  rupo  Rate  of  urban  population 

X5 :  rlpo  Rate  of  illiteracy  in  the  population 

X6 :  rspo  Rate  of  students  in  the  population 

X7 :  eltp  Expected  lifetime  of  people 

X8:  mnr  Rate  of  nutritional  needs  realised 

X9:  nunh  Number  of  newspapers  and  magazines  per  1,000  habitants 

Y10:  nuth  Number  of  television  per  1,000  habitants 


22.16  US  Health  Data 
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AFS 

South  Africa 

DAN 

Denmark 

MAR 

Marocco 

ALG 

Algeria 

EGY 

Egypt 

MEX 

Mexico 

BRD 

Germany 

ESP 

Spain 

NOR 

Norway 

GBR 

Great  Britain 

FRA 

France 

PER 

Peru 

ARS 

Saudi  Arabia 

GAB 

Gabun 

POL 

Poland 

ARG 

Argentine 

GRE 

Greece 

POR 

Portugal 

AUS 

Australia 

HOK 

Hong  Kong 

SUE 

Sweden 

AUT 

Austria 

HON 

Hungary 

SUI 

Switzerland 

BEL 

Belgium 

IND 

India 

THA 

Tailand 

CAM 

Cameroon 

IDO 

Indonesia 

URS 

USSR 

CAN 

Canada 

ISR 

Israel 

USA 

USA 

CHL 

Chile 

ITA 

Italia 

VEN 

Venezuela 

CHN 

China 

JAP 

Japan 

YOU 

Yugoslavia 

CUB 

Cuba 

KEN 

Kenia 

22.16  US  Health  Data 

This  is  a  data  set  consisting  of  50  measurements  of  13  variables.  It  states  for  1  year 
(1985)  the  reported  number  of  deaths  in  the  50  states  of  the  US  classified  according 
to  7  categories. 

X\  \  Land  area  (land) 

X2 :  Population  1985  (popu) 

X3 :  Accident  (ace) 

X4:  Cardiovascular  (card) 

X5:  Cancer  (canc) 

X6 :  Pulmonar  (pul) 

X7 :  Pneumonia  flu  (pnue) 

X8:  Diabetis  (diab) 

Xg:  Liver  (liv) 

X\0:  Doctors  (doc) 

X\  1 :  Hospitals  (hosp) 

Xn:  US  states  region  number  (r) 

X\y.  US  states  division  number  (d) 
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Division  numbers  Region  numbers 


New  England 

1 

Northeast 

1 

Mid  Atlantic 

2 

Midwest 

2 

E  N  Central 

3 

South 

3 

W  N  Central 

4 

West 

4 

S  Atlantic 

5 

E  S  Central 

6 

W  S  Central 

7 

Mountain 

8 

Pacific 

9 

22.17  Vocabulary  Data 

This  example  of  the  evolution  of  the  vocabulary  of  children  can  be  found  in  Bock 
(1975).  Data  are  drawn  from  test  results  on  file  in  the  Records  Office  of  the 
Laboratory  School  of  the  University  of  Chicago.  They  consist  of  scores,  obtained 
from  a  cohort  of  pupils  from  the  eighth  through  eleventh  grade  levels,  on  alternative 
forms  of  the  vocabulary  section  of  the  Cooperative  Reading  Test.  It  provides  the 
following  scaled  scores  shown  for  the  sample  of  64  subjects  (the  origin  and  units 
are  fixed  arbitrarily). 


22.18  Athletic  Records  Data 

This  data  set  provides  data  on  Men’s  athletic  records  for  55  countries  in  1984 
Olympic  Games. 


22.19  Unemployment  Data 

This  data  set  provides  unemployment  rates  in  all  federal  states  of  Germany  in 
November  2005. 


22.20  Annual  Population  Data 

The  data  shows  yearly  average  population  rates  for  Former  territory  of  the  Federal 
Republic  of  Germany  incl.  Berlin- West  (given  in  1,000  inhabitants). 


22.22  Bankruptcy  Data  II 
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22.21  Bankruptcy  Data  I 

The  data  are  the  profitability,  leverage,  and  bankruptcy  indicators  for  84  companies. 

The  data  set  contains  information  on  42  of  the  largest  companies  that  filed  for 
protection  against  creditors  under  Chap.  11  of  the  US  Bankruptcy  Code  in  2001- 
2002  after  the  stock  market  crash  of  2000.  The  bankrupt  companies  were  matched 
with  42  surviving  companies  with  the  closest  capitalisations  and  the  same  US 
industry  classification  codes  available  through  the  Division  of  Corporate  Finance 
of  the  Securities  and  Exchange  Commission  (SEC,  2004). 

The  information  for  each  company  was  collected  from  the  annual  reports  for 
1998-1999  (SEC,  2004),  i.e.  3  years  prior  to  the  defaults  of  the  bankrupt  compa¬ 
nies.  The  following  data  set  contains  profitability  and  leverage  ratios  calculated, 
respectively,  as  the  ratio  of  net  income  (NI)  and  total  assets  (TA)  and  the  ratio  of 
total  liabilities  (TL)  and  total  assets  (TA). 


22.22  Bankruptcy  Data  II 

Altman  (1968),  quoted  by  Morrison  (1990),  reports  financial  data  on  66  banks. 

XI  =  (Working  capital)/(total  assets) 

X2  =  (Retained  earnings)/(total  assets) 

X3  =  (Earnings  before  interest  and  taxes)/(total  assets) 

X4  =  (Market  value  equity )/(book  value  of  total  liabilities) 

X5  =  (Sales)/(total  assets) 

The  first  33  observations  correspond  to  bankrupt  banks  and  the  last  33  for  solvent 
banks  as  indicated  by  the  last  columns:  values  of  y 
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Multivariate  deneralised  hyperbolic 
distribution,  160 

Multivariate  laplace  distribution,  163 
Multivariate  median,  504 
Multivariate  t-distribution,  163,  196 

Nearest  neighbor,  395 

Non-metric  solution,  479 

Nonexistence  of  a  riskless  asset,  491 

Nonhomogeneous,  92 

Nonmetric  methods  of  MDS,  459 

Norm  of  a  vector,  72 

Normal  distribution,  203 

Normal-inverse  Gaussian,  151 

Normalized  principal  components  (NPCs),  335 

Null  space,  74 

Odds,  273 
Order  statistics,  5 
Orthogonal  complement,  75 
Orthogonal  matrix,  54 


Orthonormed,  309 
Outliers,  3 
Outside  bars,  7 

Parallel  profiles,  238 

Partitioned  covariance  matrix,  184 

Partitioned  matrixes,  66 

PAV  algorithm,  466,  484 

Pearson  chi-square,  269 

Pearson  chi-square  test  for  independence,  269 

Pool-adjacent  violators  algorithm,  466,  484 

Portfolio 

analysis,  487 
choice,  487 
Positive 

definite,  62 
definiteness,  65 
or  negative  dependence,  22 
semidehnite,  62,  90 
Principal 
axes,  70 

components,  324 
method,  372 
in  practice,  324 
technique,  324 
transformation,  321,  323 
factors,  370 

Principal  components  analysis  (PCA),  512, 
515 

Profile 

analysis,  238 
method,  476 
Projection 
matrix,  75 
pursuit,  505 
pursuit  regression,  508 
vector,  511 

Proximity  between  objects,  387 
Proximity  measure,  386 
p -value,  269 

Quadratic  discriminant  analysis,  414 
Quadratic  forms,  62 
Quadratic  response  model,  253 
Quality  of  the  representations,  339 

Randomized  discriminant  rule,  413 
Rank,  55 

Reduced  model,  103 
Rotations,  73,  374 
Row  space,  306 
Russel  and  Rao  (RR),  388 
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Sampling  distributions,  142 
Scatterplot  matrix,  21 
Separation  line,  21 
Similarity  of  objects,  387 
Simple  analysis  of  variance  (ANOVA),  100 
Simple  Matching,  388,  389 
Single  linkage,  395 
Singular  normal  distribution,  140 
Singular  value  decomposition  (SVD),  61,  313 
Sliced  inverse  regression,  511,  516 
algorithm,  512 

Sliced  inverse  regression  II,  513,  514,  516,  517 
algorithm,  514 
Solution 

nonmetric,  483 
Specific  factors,  361 
Specific  variance,  362 
Spectral  decompositions,  60 
Spherical  distribution,  195 
Standardized  linear  combinations  (SLC),  320 
Statistics,  142 
Stimulus,  475 
Student’s  t  with  n,  152 
Student’s  t-distribution,  94 
Sum  of  squares,  102 
Summary  statistics,  89 
Support  vector  machines  (SVMs),  519 
Swiss  bank  data,  4 
Symmetric  matrix,  54 


T-test,  94 
Tanimoto,  388 
Three-way  tables,  267 
Total  variation,  96 
Trace,  56 

Trade-off  analysis,  476 
Transformations,  135 
Transpose,  56 
Two  factor  method,  476 


Unbiased  estimator,  208 
Uncorrelated  factors,  361 
Unexplained  variation,  96 
Unit  vector,  72 
Upper  triangular  matrix,  54 


Variance  explained  by  PCs,  333 
Varimax 
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method,  374 
rotation  method,  374 


Ward  clustering,  396 
Wishart 

density,  192 
distribution,  191,  192 


